Open JChristopherEllis opened 3 years ago
Hi,
Sure, you can do that.
You'd start with a distance or similarity matrix, and then feed that into a hierarchical clustering algorithm. Good options could include scipy's hierarchical clustering (https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) or HDBSCAN, both of which can work on distance matrices.
For parameter election, the k will depend on how similar the genomes are. 16-19 seems to be good for generating pairwise distance across all fungal genomes in RefSeq, but if you're working with many related strains you may want something more like 30-100.
An example workflow with Scipy's Hierarchical Clustering you might follow:
import numpy as np
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as plt
x = ... # Parse distance matrix from file somehow
# If square, convert to condensed distance matrix from scipy.cluster.hierarchy
if x.ndim > 1:
from scipy.spatial.distance import squareform
x = squareform(x)
L = sch.linkage(x)
dn = sch.dendrogram(L)
You can then export the dendrogram or visualize it with matplotlib. (fig.show
after creating the dendrogram should show it.)
The downside to this is that it only works for symmetric distances in SciPy, though you should be able to use containment distance with HBDSCAN. Of course, you can convert any similarity measure (containment, jaccard) into a distance by using 1 - x
for the similarity, or you can use the Mash formula to convert a Jaccard into a distance (log((2 * x) / (1 + x)) / k
).
Spectral Clustering, for instance, will use affinities rather than distances.
I hope this helps, and let me know if you have any further questions or problems. Thanks,
Daniel
Quicktree also performs quite well
sed -i "1s/.*/$FILECOUNT/" $dashingDistanceMatrix
quicktree -in m $dashingDistanceMatrix > $newick # NJ-tree, https://github.com/khowe/quicktree
nw_reroot $newick > final.nwk # quick and dirty rooting, http://cegg.unige.ch/newick_utils
Can you create a dendrogram from the dist results?
Also, could you recommend parameters for large fungal genome comparison?