RabbitBio / RabbitTClust

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches
Other
44 stars 3 forks source link

taxonomy badnumber #3

Closed kenietz closed 1 year ago

kenietz commented 2 years ago

Hi,

first of all thank you for the good program! It works very well and fast! :) Even on huge datasets!

Now onto the issue. I am clustering Refseq viral seqs. But i see this:

--- clust-mst output ---

./RabbitTClust/clust-mst -l -i viral_paths.list -d 0.01 -t 30 -o vir.mst.clust sketch by file: the inputFile is: viral_paths.list set the threshold: 0.010000 set output file: vir.mst.clust ===the number is: 6021 ===the badNumber is: 8040 ===the totalNumber is: 14061

So i have 14061 genomes. And after clustering i get 5554 representative seqs out of 5913 seqs in the clust file.

From what i seen in the code it seems that the 'badNumber' is result of taxonomy check. But Refseq was updated so maybe now sequences are excluded because of outdated taxonomy? Is there a way to update the taxonomy or maybe not to do that check and just use all sequences? Why is that check needed at all?

Any help or hints will be appreciated!

EDIT: maybe not connected to Taxonomy. i just opened Sketchinfo.cpp there i can see 'badNumber' is increased if 'length < 10000'. Is that min size of the genome or? In anycase a lot of sequences are being excluded.

Best regards Dimitar

XiaomingXu1995 commented 2 years ago

Hi, Thank you for your comment.

First of all, sorry about the misguided "badNumber" for the taxonomy consideration. As you have seen, the "badNumber" is the number of input genomes with lengths less than 10000.

RabbitTClust is designed for clustering long genome sequences (at the Discussion and conclusion part of our paper https://doi.org/10.1101/2022.10.13.512052). So genomes with lengths less than 10000 were ignored.

I have added a parameter "-m" to set the minimum filter genome length, which was a fixed length of 10000 in the last version. You can compile the latest version and run RabbitTClust with the "-m 0" parameter to use all genomes.

Best Xiaoming Xu

kenietz commented 2 years ago

Hi, thank you very much for your reply!

Yes, i figured it out. Before you replied i found where the 10000 limit was used in the source and changed it manually and recompiled. Then worked. However, adding the option to control the minimum length is really great! Makes things easier to control!

Just for information, for say bacteria and archaea is not a problem, the 10000 limit. But viral has quite a bit of genomes as low as 1000nt. Then they are filtered.

Thank you again!

Best regards Dimitar

ZekunYin commented 1 year ago

Hi, Thanks for your valuable feedback. If you have any questions when using our software, please contact us with no hesitation! Best, Zekun