Open hw449 opened 5 years ago
I am not aware that Markov Clustering can even theoretically do a soft clustering. Can you share your Adjacency matrix so we can test it ?
Thanks. The file containing edge information is too huge (713Mb) to share with you. Here is my code. It is very short and simple. Could you please help me check whether there are some problems? I don't have this problem last year.
import pandas as pd import markov_clustering as mc import networkx as nx import numpy as np
data=pd.read_csv('BLAST_results/allv3_to_allv3_reduced',sep='\t') edges_with_weights=[(data['qgeneid'][i],data['sgeneid'][i],data["sbitscore"][i]) for i in range(len(data))]
G=nx.Graph() G.add_weighted_edges_from(edges_with_weights) matrix=nx.to_scipy_sparse_matrix(G) clusters=mc.get_clusters(mc.run_mcl(matrix,inflation=1.1))
nodes=list(G.nodes()) gene_families=[] for i,tup in enumerate(clusters): for item in tup: gene_families.append([i,nodes[item]]) gene_families=pd.DataFrame(gene_families,columns=['family_id','gene_id']) gene_families.to_csv('gene_families.csv')
The algorithm can generate situations where a node belongs to multiple clusters. See the description by the original algorithm author here (specifically section 9.3).
I need to check if I am handling these cases correctly.
It would be really helpful to have access to the graph that you are performing the clustering on. Could you try to export the graph using
write_weighted_edgelist(G, path_to_file)
and then zip up the file. Maybe that would be small enough to share?
Sure. It is now 33.4Mb! How can I share with you?
I observed an increase of the number of nodes belonging to multiple clusters when inflation was set to a higher value.
In case the algorithm is actually assigning a node to multiple clusters, do we consider it as a bad behavior and fix it ? cause if its something mcl does naturally why modify it ?
Also I'll take a look at the author's description and see what I can gain from it.
Soft clustering sometimes causes problems in downstream analysis when hard clustering is required. It would be better if we can choose between soft/hard clustering. Thanks.
Ok, so from what I've read it's quite easy to fix. The orignal author simply assigned the node to the 1st cluster it appeared in and only to that cluster plus it gave you a warning that it was the case. An option to keep the overlap should be added too. I think I'll handle this one if @GuyAllard is ok with that ans submit the pr asap.
@hw449 can you email me your graph? I think I have fixed the issu and would like to test it.
I sent the graph to your gmail. Thanks.
Moonire - if you have time to address this, it would be a big help!
I have pushed a solution and tested it on @hw449 graph and it works like a charme. simply execute the following line and it should be okay.
clusters = mc.get_clusters(mc.run_mcl(matrix, inflation=2), keep_overlap=False)
Thanks! So how can I install this newest version? Still using "pip install"?
For that you'd have to wait until @GuyAllard merges my pull request, until then you can download my repo and replace the mcl.py file in python\Libs\sitepackage\markov_clustering
by mine, it will do the trick.
I run your latest mcl.py just now and saw the warning "to unable soft clustering set keep_overlap to True". I was confused, since I think soft clustering refers to a situation where different clusters can overlap with each other (i.e. share common nodes). Based on my understanding, if I want to unable soft clustering, I should set "keep_overlap" to False, rather than to True.
That was poor English on my account ! i'll correct it.
I tried to use the following code not to perform soft clustering: clusters = mc.get_clusters(mc.run_mcl(matrix, inflation=2), keep_overlap=False) However, I have got the following error: TypeError: get_clusters() got an unexpected keyword argument 'keep_overlap' Please advice me to solve this problem.
same error! any suggestion?
@lcd522 You have to patch the file yourself until the authors push the fix
I tried but this is what I also get "get_clusters() got an unexpected keyword argument 'keep_overlap"
After run_mcl and get_clusters, I found that one node belongs to two clusters. Is this method doing a soft clustering?