BioMedBigDataCenter / VENAS

15 stars 7 forks source link

Regarding Cluster_id in nodeTable.csv #3

Open vinitamehlawat opened 2 years ago

vinitamehlawat commented 2 years ago

Hi @qianjiaqiang

Thank you so much for making this fantastic tool available to us. For my SARS-CoV2 data, I tried your tool. I got a pi_pos_all.fasta , node_all.txt, net_all.txt, freq_all.txt file and three .csv files, when I performed the VENAS. When it comes to importing metadata to my graph, I'm becoming a little puzzled:

  1. What are the different cluster ids for in nodeTable.csv? If I colour by cluster, it will assist me understand the transmission pattern in my data?
  2. Following is the edgeTable I have for my data .
Source Target
5 3
5 246
5 97
5 12
5 26
5 20
5 21
5 237
5 70
5 88
5 159
5 52
5 82
5 92
92 197
5 152
3 7
3 57
3 60
246 247

Is this to say that the key paths are 5, 92, 3, and 246 and that further transmission is taking place from these sources?

  1. I previously inquired about the country/city/state metadata from pi_pos all.fasta. I have about 15000 sequences in input.ma and in pi pos all.fasta I received 4000, but in net.csv has 839 columns which means I have to grep the metadata from `pi_pos_all.fasta first 839 lines But this is only giving the information about 3 countries only But in pi_pos_all.fasta I have more than a dozen countries.

If you could just clarify these points, I would greatly appreciate your time and effort.

Thank you Vinita

qianjiaqiang commented 2 years ago

Yeah, I will invite our colleague to modify VENAS's README to clarify those points . Some updates are needed since the VENAS paper was accepted recently.
2 or 3 days? :)

qianjiaqiang commented 2 years ago

@vinitamehlawat README is updated.

vinitamehlawat commented 2 years ago

Hi @qianjiaqiang

Thank you very much for providing such a thorough explanation; my data now makes sense. I'll try network again and let you know if I have any further questions.

Thanks Vinita

vinitamehlawat commented 2 years ago

Hi @qianjiaqiang

I rerun the VENAS on my data, So when I ran the 3rd command which is python3 main_path_example.py on my terminal it gave output:

`python3 main_path_example.py

75 96 210 286 182 407 41 229 277 77 541 411 95 420`

And I realised that all these numbers belongs to cluster_id 0 . You explained in README.md that 'ClusterId indicates the classification to which the node belongs'. It would be really helpfull if you could just eloborate what exactly means classification? is this related to sequences which are having same ePIS or something else that they are in different different clusters.

Thanks Vinita

qianjiaqiang commented 2 years ago

Don't worry about these output. You can comment line:163 in . The code on this line print out the "small_node".

In the program, keyNode means the central node of cummunities generated by louvain algo. Through filtering keyNodes with a threshold , we get filterNodes that equal to the TRUE center of cluster. small_node belongs to keyNodes but not included in filterNodes and transNode(nodes on main paths). And at last ,these small_nodes will be assigned the same color as the keyNode directly link with them.

qianjiaqiang commented 2 years ago

"the same color" , you can treat it as the same cluster id in the final result.

vinitamehlawat commented 2 years ago

HI @qianjiaqiang

Thank you so much for your prompt reply, As per your suggestion I have colored my network on basis of different cluster_ids and I have set the bigger size for the keyNode which having value = 1000 in nodeTable.csv.However, I have a few more questions: 1). In nodeTable.csv I have cluster_ids in countinous number series like from 0 to 19 but after 19 I have cluster_id is 34 is this something specific about this pattern?

2). So you mentioned about louvain algorithm , So In my data in cluster_id 2 I have more than 6 different countries and there is one keyNode which is let's say singapore, is singapore having value 1000 representing that from Singapore mutations are transmitting to different countries which are in smaller size in that perticular cluster_id 2 ?

Please accept my apologies for troubling you yet again.

Thanks Vinita

qianjiaqiang commented 2 years ago

yeah , value 1000 is used to help us draw graph in gephi. Value 1000 == filterNode == cluster center node

i) no problem. When the program assign color(cluster) to each node, the number sometimes is not continuous ii) emmm, one node represents a virus sequence. Value doesn't mean anything more than a weight in gephi.

In my opinion , for the filterNode(cluster center, value 1000) ,it may imply a critical point of infections. If you are interesting in the transmission ,you can inspect the transNode X Y Z. you can print the return_path variables (main_path_example.py).

cluster 0 --- transNode X --- transNode Y --- cluster 1

vinitamehlawat commented 2 years ago

Hi @qianjiaqiang

Here I am attching a png from my network which I made and colored on basis of cluster_id(I used cytoscape) In this image in brown color is my cluster_id 2 and in purple color cluster_id is 5 in which you can see I have labelled with country name. It would be great if you could just explain what exactly this means of whole cluster like on what basis it is made, so if I consider louvain algorithm which detect different communities, so Singapore which is keyNode in this cluster what exactly it is telling us, that it is critical point of infections for these different countries or different mutation are travelling from singapore major node to other node ?

image1

2nd I mage I am pasting here from Gephi, In light pink color this is cluster_id 2, how I will explain that there are several countries?

image2

If you could just simply explain a little bit what's going on in these networks, I'd really appreciate it.

Thank you Vinita

qianjiaqiang commented 2 years ago

One node represent one unique sequence(found in one(two ,three or more ) country/region). If one node only consists of one country, i.e. Singapore , then it is from Singapore.

On the Fig1, the key Node is the largest brown/purple node( because in their community ,this node connects more nodes than others). You can find the trans-path from 'return _path' also on the Fig1 manually. trans-path started from one key Node and ended at another key Node. For the Fig2, you can add country information in Node csv file for each node and gephi can display it.

vinitamehlawat commented 2 years ago

ahh now this whole picture is clear to me, Thank you so much @qianjiaqiang for your kind help.

qianjiaqiang commented 2 years ago

154208999-80af1825-5197-4a73-9b9a-270ac83ac786

vinitamehlawat commented 2 years ago

Thank you so much @qianjiaqiang for this excellent assistance