MrOlm / drep

Rapid comparison and dereplication of genomes
242 stars 35 forks source link

mash pre-cluster sketch size? #137

Open jianshu93 opened 2 years ago

jianshu93 commented 2 years ago

Dear dRep team,

This is confusing to me when using mash sketch size 1000:

def run_mash_on_genome_chunks(genome_chunks, mash_exe, sketch_folder, MASH_folder, logdir, **kwargs): dry = kwargs.get('dry', False) p = kwargs.get('processors', 6) MASH_s = kwargs.get('MASH_sketch', 1000) multi_round = kwargs.get('multiround_primary_clustering', True)

If you check the fastANI paper, table 2, sketch size 1000 is very bad at nearly all dataset with traditional blast based ani and fastANI. At lease 10^4 is a good one, or 10^5, so that the pre cluster ANI is close to the FastANI or traditional ANI value. Even with 10^5 (Figure 1 (a)), below 80%, mash is still not close to the real ANI values but an approximate. Any idea why use sketch size 1000, which works only for very distantly related genomes ? Pre cluster at any ANI value larger than 80%, 1000 is far away from enough. It will be nice if there is a sketch size and kmer option passed to mash.

Thanks,

Jianshu

MrOlm commented 2 years ago

Hi Jianshu,

1) That sketch size is only used for Mash, not fastANI. In dRep, the goal of Mash is to provide a quick pre-clustering, so the accuracy doesn't matter very much. That small sketch size is chosen to make this first step as fast as possible, since speed is the goal of the primary clustering.

2) You can adjust this value to be whatever you like using the -ms parameter.

Best, Matt

jianshu93 commented 2 years ago

Hello Matt,

Thanks for the quick response, what if I want to pre cluster at 85% ANI, then exact ANI at 90%, but the sketch size 1000, will never approximate 85%, but 88% or so (small sketch size will need to underestimate ANI, so a 85% ANI (as you thought) precluster could indicate larger ANI value ). So two pair that is actually around 90% ANI will have the possibility to be put into different clusters, the exact fastANI comparison will then miss this pair of comparison, so dereliction can be not what the user expect. Do you see my point? For very high ANI dereplcation, like 95%, there are no problems because pre cluster will never reach that resolution. This only arise when we want to dereplicated at smaller ANI like 90%, 85%, or so.

Thanks,

Jianshu

MrOlm commented 2 years ago

Hi Jianshu,

Ah I see- I understand now. In that circumstance it would certainly make sense to increase the -ms parameter to a higher value, but I don't really want to change the default value in order to keep the program run-time up

Best, Matt

jianshu93 commented 2 years ago

Hi Matt,

True, most of the cases, users want to dereplicate at higher ANI so speed is more important. I was in a case where I want to cluster at 85% ANI, precluster should be 80% or something, even with 10^4 sketch size, mash is till much faster than FastANI, even though the overall process will take a long time. So yes, just a reminder that this could happen and we should be cautious. And say that if users want to have a lower pre cluster ANI value, should increase sketch size. Does that sound reasonable? I have strange dereplication results compare to use FastANI only at 85%.

Thanks,

Jianshu

MrOlm commented 2 years ago

I see- this does make sense and does sound reasonable. I'll look into adding a warning like this during the next dRep update