Open jianshu93 opened 2 years ago
Hi Jianshu,
You should be able to use the same equation converting k-mer similarity fraction to ANI and for AAI, substituting the relevant statistics.
Specifically:
1 + log(2*J/(1+J)) / k
For Python code, you might perform something like:
amino_jaccards = # somehow set the vector of Jaccard similarities, parsing or otherwise
est_amino_identity = 1. + np.log(2 * amino_jaccards / (1. + amino_jaccard)) / k
This transformation is really all you need. Also, in my experiments, weighted Jaccard (probminhash or bagminhash) can yield some more accurate ANI estimates than set-based Jaccard (albeit slower/more memory); depending on the nature of the data, it might be worth trying weighted extensions.
Thanks,
Daniel
thanks daniel.This is very helpful.
jianshu
Hello Daniel,
For nt Jaccard distance, estimated by either MinHash (e.g. probminhash) , we can follow the MASH paper to do a log function transformation (-1/k*(2log(J)/(log(J)+1))) to approximate ANI, what if it is the Jaccard distance of amino acid/preotein sequences? We should make some adjustment to it right to approximate AAI (average amino acid identity)?
Thanks,
Jianshu