dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

aminoacid distance to AAI? #58

Open jianshu93 opened 2 years ago

jianshu93 commented 2 years ago

Hello Daniel,

For nt Jaccard distance, estimated by either MinHash (e.g. probminhash) , we can follow the MASH paper to do a log function transformation (-1/k*(2log(J)/(log(J)+1))) to approximate ANI, what if it is the Jaccard distance of amino acid/preotein sequences? We should make some adjustment to it right to approximate AAI (average amino acid identity)?

Thanks,

Jianshu

dnbaker commented 2 years ago

Hi Jianshu,

You should be able to use the same equation converting k-mer similarity fraction to ANI and for AAI, substituting the relevant statistics.

Specifically:

1 + log(2*J/(1+J)) / k

For Python code, you might perform something like:

amino_jaccards = # somehow set the vector of Jaccard similarities, parsing or otherwise
est_amino_identity = 1. + np.log(2 * amino_jaccards / (1. + amino_jaccard)) / k

This transformation is really all you need. Also, in my experiments, weighted Jaccard (probminhash or bagminhash) can yield some more accurate ANI estimates than set-based Jaccard (albeit slower/more memory); depending on the nature of the data, it might be worth trying weighted extensions.

Thanks,

Daniel

jianshu93 commented 2 years ago

thanks daniel.This is very helpful.

jianshu