GuyAllard / markov_clustering

markov clustering in python
MIT License
167 stars 37 forks source link

Some nodes belong to more than one clusters. #15

Open hw449 opened 5 years ago

hw449 commented 5 years ago

After run_mcl and get_clusters, I found that one node belongs to two clusters. Is this method doing a soft clustering?

Moonire commented 5 years ago

I am not aware that Markov Clustering can even theoretically do a soft clustering. Can you share your Adjacency matrix so we can test it ?

hw449 commented 5 years ago

Thanks. The file containing edge information is too huge (713Mb) to share with you. Here is my code. It is very short and simple. Could you please help me check whether there are some problems? I don't have this problem last year.

import pandas as pd import markov_clustering as mc import networkx as nx import numpy as np

load edge information from a csv file (qgeneid, sgeneid, and sbitscore are two nodes and a weight, respectively)

data=pd.read_csv('BLAST_results/allv3_to_allv3_reduced',sep='\t') edges_with_weights=[(data['qgeneid'][i],data['sgeneid'][i],data["sbitscore"][i]) for i in range(len(data))]

clustering

G=nx.Graph() G.add_weighted_edges_from(edges_with_weights) matrix=nx.to_scipy_sparse_matrix(G) clusters=mc.get_clusters(mc.run_mcl(matrix,inflation=1.1))

write results to a csv file

nodes=list(G.nodes()) gene_families=[] for i,tup in enumerate(clusters): for item in tup: gene_families.append([i,nodes[item]]) gene_families=pd.DataFrame(gene_families,columns=['family_id','gene_id']) gene_families.to_csv('gene_families.csv')

GuyAllard commented 5 years ago

The algorithm can generate situations where a node belongs to multiple clusters. See the description by the original algorithm author here (specifically section 9.3).

I need to check if I am handling these cases correctly.

It would be really helpful to have access to the graph that you are performing the clustering on. Could you try to export the graph using

write_weighted_edgelist(G, path_to_file)

and then zip up the file. Maybe that would be small enough to share?

hw449 commented 5 years ago

Sure. It is now 33.4Mb! How can I share with you?

hw449 commented 5 years ago

I observed an increase of the number of nodes belonging to multiple clusters when inflation was set to a higher value.

Moonire commented 5 years ago

In case the algorithm is actually assigning a node to multiple clusters, do we consider it as a bad behavior and fix it ? cause if its something mcl does naturally why modify it ?

Also I'll take a look at the author's description and see what I can gain from it.

hw449 commented 5 years ago

Soft clustering sometimes causes problems in downstream analysis when hard clustering is required. It would be better if we can choose between soft/hard clustering. Thanks.

Moonire commented 5 years ago

Ok, so from what I've read it's quite easy to fix. The orignal author simply assigned the node to the 1st cluster it appeared in and only to that cluster plus it gave you a warning that it was the case. An option to keep the overlap should be added too. I think I'll handle this one if @GuyAllard is ok with that ans submit the pr asap.

Moonire commented 5 years ago

@hw449 can you email me your graph? I think I have fixed the issu and would like to test it.

hw449 commented 5 years ago

I sent the graph to your gmail. Thanks.

GuyAllard commented 5 years ago

Moonire - if you have time to address this, it would be a big help!

Moonire commented 5 years ago

I have pushed a solution and tested it on @hw449 graph and it works like a charme. simply execute the following line and it should be okay.

clusters = mc.get_clusters(mc.run_mcl(matrix, inflation=2), keep_overlap=False)

hw449 commented 5 years ago

Thanks! So how can I install this newest version? Still using "pip install"?

Moonire commented 5 years ago

For that you'd have to wait until @GuyAllard merges my pull request, until then you can download my repo and replace the mcl.py file in python\Libs\sitepackage\markov_clustering by mine, it will do the trick.

hw449 commented 5 years ago

I run your latest mcl.py just now and saw the warning "to unable soft clustering set keep_overlap to True". I was confused, since I think soft clustering refers to a situation where different clusters can overlap with each other (i.e. share common nodes). Based on my understanding, if I want to unable soft clustering, I should set "keep_overlap" to False, rather than to True.

Moonire commented 5 years ago

That was poor English on my account ! i'll correct it.

dhrubajyotiborah commented 3 years ago

I tried to use the following code not to perform soft clustering: clusters = mc.get_clusters(mc.run_mcl(matrix, inflation=2), keep_overlap=False) However, I have got the following error: TypeError: get_clusters() got an unexpected keyword argument 'keep_overlap' Please advice me to solve this problem.

lcd522 commented 1 year ago

same error! any suggestion?

codykingham commented 7 months ago

@lcd522 You have to patch the file yourself until the authors push the fix

xelleze commented 5 months ago

I tried but this is what I also get "get_clusters() got an unexpected keyword argument 'keep_overlap"