dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
161 stars 11 forks source link

Dendogram #75

Open JChristopherEllis opened 3 years ago

JChristopherEllis commented 3 years ago

Can you create a dendrogram from the dist results?

Also, could you recommend parameters for large fungal genome comparison?

dnbaker commented 3 years ago

Hi,

Sure, you can do that.

You'd start with a distance or similarity matrix, and then feed that into a hierarchical clustering algorithm. Good options could include scipy's hierarchical clustering (https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) or HDBSCAN, both of which can work on distance matrices.

For parameter election, the k will depend on how similar the genomes are. 16-19 seems to be good for generating pairwise distance across all fungal genomes in RefSeq, but if you're working with many related strains you may want something more like 30-100.

An example workflow with Scipy's Hierarchical Clustering you might follow:

import numpy as np
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as plt

x = ... # Parse distance matrix from file somehow
# If square, convert to condensed distance matrix from scipy.cluster.hierarchy
if x.ndim > 1:
    from scipy.spatial.distance import squareform
    x = squareform(x)

L = sch.linkage(x)
dn = sch.dendrogram(L)

You can then export the dendrogram or visualize it with matplotlib. (fig.show after creating the dendrogram should show it.)

The downside to this is that it only works for symmetric distances in SciPy, though you should be able to use containment distance with HBDSCAN. Of course, you can convert any similarity measure (containment, jaccard) into a distance by using 1 - x for the similarity, or you can use the Mash formula to convert a Jaccard into a distance (log((2 * x) / (1 + x)) / k).

Spectral Clustering, for instance, will use affinities rather than distances.

I hope this helps, and let me know if you have any further questions or problems. Thanks,

Daniel

mihkelvaher commented 3 years ago

Quicktree also performs quite well

sed -i "1s/.*/$FILECOUNT/" $dashingDistanceMatrix
quicktree -in m $dashingDistanceMatrix > $newick # NJ-tree, https://github.com/khowe/quicktree
nw_reroot $newick > final.nwk # quick and dirty rooting, http://cegg.unige.ch/newick_utils