BioMedBigDataCenter / VENAS

15 stars 7 forks source link

Regarding Clusters #5

Open vinitamehlawat opened 2 years ago

vinitamehlawat commented 2 years ago

Hi @qianjiaqiang

I have some queries regarding Cluster defination:

You periviously mentioned that clusters are basically generated by louvain algorithm BUT in sense of genetics what exactly is cluster:

Wether these are the collection of exactly same sequence, like: ATGCATGCATGC ATGCATGCATGC ATGCATGCATGC

OR it is ATGGCATGGC AAGGCATGGC AAGGCATGGC ATGGCATGGC (Sort of similer sequences with one bp chnage)

OR could be

ATTTCCGGT AAAACCCCA (Having more number of variability in the sequences)

Thing is that I am getting high number of clusters but i am not able to interpret the cluster in genetic way

EXAMPLE: There is one cluster which is Dominating by BA.1.1 lineage but 3 sequences from B.1 lineage as well in thi s same cluster (I have used -r=1 and -b=0 to retain all my sequences)

It would be very great if you could explain it

Thank you very much Vinita

lingyunchao commented 2 years ago

@vinitamehlawat

VENAS uses the ePISs(effective parsimony-informative sites) to represent the sequence. For identical sequences, it is shown as a single node on the network, and for sequences with only one bp change, it is shown as two nodes if each type occurs at least twice.

VENAS uses the neighbor-joining method to construct the network, trying to connect the sequence with the smallest differences to form an undirected acyclic graph. The links between nodes represent differences or variations between viral genomes, and may also reveal transmission routes when enough samples have been sequenced.

Louvain is a disjoint community detection method to cluster the VENAS network into topologically linked subdomains, which represent different evolution clades containing many closely-connected genome types. Such segmentation enabled us to subjectively identify the topological clades with “tight” intraclade connectivity and the “sparse” interclade connectivity, which reflect the relationship of different genome types among viral communities formed during natural transmission.

Pangolin uses a decision tree to compute the PANGO lineage. For sequences with incomplete features, especially those leaf nodes on the VENAS network, Pangolin may not be able to assign an accurate lineage because new features have not been trained by the model. So the 3 sequences from B.1 lineage you mentioned may actually be close to the BA.1.1 lineage.