Open tnmquann opened 1 month ago
Thanks for the suggestion @tnmquann ! We (@mahmudhera and @chunyuma ) have recently been working on this exact issue, but from a different direction: the reference database formation step in yacht train
contains an inherently quadradic step, in that all genomes need to be compared to all others to identify those that are within the ANI threshold. Taking a different algorithmic approach than anything in branchwater, we've been able to reduce the training time on a datatset of ~2.7M genomes from a month to about 3 days on a 128 core server. It will take a while, but we will eventually make that an official part of YACHT.
For the "only supporting one sample at a time", since running yacht on different samples is independent from running it on any other sample, doesn't something like gnu parallel
or xargs -P
work? Doing it in the yacht run
itself wouldn't actually save much time at all, save for the very little bit of time to load in the reference/training database.
Hi @dkoslicki , thank you for letting me know about the upcoming release, and it’s exciting to hear about the algorithmic improvements to reduce training time. I’ll be looking forward to seeing that in action when it’s ready.
For running multiple samples, I’m currently using gnu parallel
to process them simultaneously, as you suggested. However, I just had a sudden thought: would there be any significant time savings if the database was loaded once for all queries, similar to how the fastmultigather
module operates, and then using multithreading to process multiple samples at the same time? Just a curiosity that popped up while working with YACHT.
Thanks again, and really excited for the next release.
We have experimented with loading the database once and using multithreading to process multiple samples, and found that there were very negligible gains (on the order of seconds). This might be helpful when you have a massive reference database, which typically occurs with a very high ANI value (eg. 0.99995), but in such cases, a more targeted approach seems better (focusing on a specific clade or clades)
Hi,
I've been using yacht as a way to reduce false positives in sourmash, and I wanted to ask if it's possible to update the tool to incorporate the latest features from _sourmash_pluginbranchwater? This would be helpful for a couple of reasons:
I believe incorporating improvements like supporting new rocksdb data format and using manysketch and/or fastmultigather could help reduce processing times and allow handling of multiple samples simultaneously.
Thanks for the great tool, and I'm looking forward to potential improvements in future releases!