RabbitBio / RabbitTClust

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches
Other
39 stars 3 forks source link

missing tips in newick tree #7

Open Djeppschmidt opened 11 months ago

Djeppschmidt commented 11 months ago

Hello,

I'm really appreciative of the newick format that you recently introduced!

I think this is a bug in building the tree. As I'm working with the newick file, it appears the newick tree is missing internal nodes; rather about half the nodes are labeled with the names that should actually be tips on the tree. For example, I ran rabbitTclust to cluster all salmonella in the NCBI pathogen database (~500k isolates) using the following code:

clust-mst -d 0.001 -l -i fasta_input.txt --newick-tree -o sal.mst.clust.0001

I generate a tree with ~270k tips, and ~238k nodes (it should have ~500k tips).

I ran a tiny version of this with 8 isolates, which produced 3 tips, and 5 internal nodes:

(((/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863221_contigs_skesa.fasta:0.000794,(/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863395_contigs_skesa.fasta:0.016157)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR900926_contigs_skesa.fasta:0.000969,(/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863393_contigs_skesa.fasta:0.001294)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863392_contigs_skesa.fasta:0.013981)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863223_contigs_skesa.fasta:0.000000)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863224_contigs_skesa.fasta:0.020389)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863396_contigs_skesa.fasta;

This makes it impossible to filter the tree by tips because half the isolates are actually node labels, when I believe they should be tip labels.

I'm curious if anyone else is experiencing this issue? Or maybe I'm missing something?

Thanks for you help, Dietrich

XiaomingXu1995 commented 11 months ago

Hi Dietrich,

Thanks for your issue!

I am currently on a business trip this month. I will check it and let you know when I have progress.

Best, Xiaoming

avw-adifranco commented 7 months ago

Hi Xiaoming,

I got the same issue as Dietrich as I was hoping the tree to output each cluster as a leave. Instead, most clusters are actually present as named internal nodes.

Did you had some times to look into it ? I believe the reason could come from the presence of 0 in the distance matrix as some sequence could be considered as subsets of the other. Maybe replacing those 0 by a really small distance value could produce what Dietrich and I would expect.

If you could pinpoint in your code where the newick tree is done, I could look more into it.

Best, Arnaud

XiaomingXu1995 commented 7 months ago

Apologies for the delayed response.

The Newick Tree in RabbitTClust represents the output format of the Minimum Spanning Tree generated in clust-mst. Unfortunately, it is not possible to designate all genome nodes as leaf nodes, as the connections of the edges in the Minimum Spanning Tree are dependent on internal nodes.

Best, Xiaoming