Request for update sourmash_plugin_branchwater lastest features and performance improvements in yacht

KoslickiLab / YACHT

A mathematically characterized hypothesis test for organism presence/absence in a metagenome

MIT License

30 stars 8 forks source link

Request for update sourmash_plugin_branchwater lastest features and performance improvements in yacht #124

Open tnmquann opened 1 month ago

tnmquann commented 1 month ago

Hi,

I've been using yacht as a way to reduce false positives in sourmash, and I wanted to ask if it's possible to update the tool to incorporate the latest features from _sourmash_pluginbranchwater? This would be helpful for a couple of reasons:

Currently, the newest version of yacht only supports processing one sample at a time, which becomes time-consuming when working with many samples.
As highlighted in the tutorial, the training process is indeed time-consuming, especially with large databases. I've been training GTDB-R220 (all genomes) for nearly a week without results, whereas training on the genomic representatives version only took me about a morning. This performance gap is significant.

I believe incorporating improvements like supporting new rocksdb data format and using manysketch and/or fastmultigather could help reduce processing times and allow handling of multiple samples simultaneously.

Thanks for the great tool, and I'm looking forward to potential improvements in future releases!

dkoslicki commented 1 month ago

Thanks for the suggestion @tnmquann ! We (@mahmudhera and @chunyuma ) have recently been working on this exact issue, but from a different direction: the reference database formation step in yacht train contains an inherently quadradic step, in that all genomes need to be compared to all others to identify those that are within the ANI threshold. Taking a different algorithmic approach than anything in branchwater, we've been able to reduce the training time on a datatset of ~2.7M genomes from a month to about 3 days on a 128 core server. It will take a while, but we will eventually make that an official part of YACHT.

For the "only supporting one sample at a time", since running yacht on different samples is independent from running it on any other sample, doesn't something like gnu parallel or xargs -P work? Doing it in the yacht run itself wouldn't actually save much time at all, save for the very little bit of time to load in the reference/training database.

tnmquann commented 1 month ago

Hi @dkoslicki , thank you for letting me know about the upcoming release, and it’s exciting to hear about the algorithmic improvements to reduce training time. I’ll be looking forward to seeing that in action when it’s ready.

For running multiple samples, I’m currently using gnu parallel to process them simultaneously, as you suggested. However, I just had a sudden thought: would there be any significant time savings if the database was loaded once for all queries, similar to how the fastmultigather module operates, and then using multithreading to process multiple samples at the same time? Just a curiosity that popped up while working with YACHT.

Thanks again, and really excited for the next release.

dkoslicki commented 1 month ago

We have experimented with loading the database once and using multithreading to process multiple samples, and found that there were very negligible gains (on the order of seconds). This might be helpful when you have a massive reference database, which typically occurs with a very high ANI value (eg. 0.99995), but in such cases, a more targeted approach seems better (focusing on a specific clade or clades)