JianiC / RSV_Epitope

5 stars 2 forks source link

MDS with T-cell epitope as input for antigenic map #5

Open JianiC opened 3 years ago

JianiC commented 3 years ago
Screen Shot 2020-10-06 at 10 18 23 AM

Ancestral Sequence reconstruction

Benefits for Beast https://groups.google.com/g/beast-users/c/P4_buh3u_5A

a pilot test with Smith.et.,al data

  1. EPICC score was calculated only with the shared epitopes
  2. n*n matrix was visualized with MDS and the compressed MDS data were used in the K-means Clustering Try with K=10 based on different methods to find the optimal number of K Smith_epicc_mds
  3. pair with Phylogenetic tree with same sequences

    Do not really observed monophylatic cluster on the phylogenetic tree, but seems there are some genetic diversity accumulate within the immune-cluster???

    smith_mds_tree

  4. the distance to the reference sequence (BI_16190_68 used in Smith Paper) was measured form the MDS with straight line distance between points. MDS is used to covert similarly to dissimilarity, with EPICC estimate itself, the latest esolate has less similar to the reference, but not same after using MDS method to get the distance? smith_mds_distance ref: EPICC (similarly) estimate against BI_16190_68 simith_epicc_share_BI68 Try to find the optimal number of cluster used for K-means cluster smith_eibow smith_K_silhouette smith_K_gap
JianiC commented 3 years ago

Edit:

  1. Normalize the input: Shared epitope content *2/(epitope content i+ epitope content j), so the input is the proportion of shared T-cell immunity between two strains
  2. change the equation to calculate distance : There are many different ways to convert a similarity or dissimilarity into the distance. By default with mds <-data %>% dist() %>% cmdscale(k=2) where distance is calculated with euclidean distances between the rows ( shortest distance between two point)

Now, I treat the similarity proportion as a correlation matrix where corr=sqrt(data) dist = sqrt(2*(1-corr)) where the percentage of shared variance is represented by the square of the correlation coefficient, r2

  1. I also try with less number of clusters, however, there still seems some problems, maybe other cluster methods need to be considered. MDS_1020 pair_tree1020 new distance1020

I could at least observe the gradual evolution of the T-cell immunity now, similar to Smith paper Next: Try with RSV, also ask for comments with these calculations

JianiC commented 3 years ago

New version

Simplify the dist calculation : dist=1-similarity k-mean_cluster_T

In smith paper, they also adjust the cluster determined from k-means to manually make it match the phylogeny

pair-tree mds_distance

from a T-cell immunity mds plot color by antigenetic cluster defined by Smith paper

smith_T_mds_antigenetic_grup

JianiC commented 3 years ago

3 D map cluster

still can not separate all of the cluster, but seems to be helpful

Screen Shot 2020-11-02 at 4 11 23 PM
JBahl commented 3 years ago

Hmm - can you add a fourth dimension? Year of isolation?

Sent from my iPhone

On Nov 2, 2020, at 4:14 PM, JianiC notifications@github.com wrote:

 [EXTERNAL SENDER - PROCEED CAUTIOUSLY]

3 D map cluster

still can not separate all of the cluster, but seems to be helpful [Screen Shot 2020-11-02 at 4 11 23 PM]https://user-images.githubusercontent.com/47227610/97919820-72fbcf00-1d26-11eb-8786-3c6f1a1d783d.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/JianiC/RSV_Epitope/issues/5#issuecomment-720728131, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADSPRN5AYBL53DQBNWQRHFTSN4OLPANCNFSM4SGC3WKA.

JianiC commented 3 years ago

not a real 4d plot, but the color has been used as legend

Screen Shot 2020-11-02 at 11 51 40 PM
JianiC commented 3 years ago

alternatively try to build mds with euclidean distances

difference compared with n*n matrix

with n*n comparasion, the distance was just calculate with 1- similarity, to correct what I have done it is not mds, it is just PCA Maybe for my RSV research, try to use the clade represent sequence in RSV ??? Here, the cross-immunity to each sequence was take as a features, and the relative distance between each strains were calculated using euclidean distance algorithm, (minimal length between each point) locations of the vaccine strain were added

Screen Shot 2020-11-10 at 12 09 04 AM Screen Shot 2020-11-10 at 12 11 56 AM

new issue, the T-immunity distance do not follow linear clock signal

Screen Shot 2020-11-10 at 12 13 25 AM Screen Shot 2020-11-10 at 12 14 14 AM

To address the comments from Friday meeting

K-means: to further evaluate the quality of the clustering

  1. Elbow method : minimize total within-cluster sum of square 2.Average Silhouette Method: maximize the average silhouette values
  2. gap statistics: total intracluster variation for different values of k with their expected values under null reference distribution of the data

Density based clustering rationale: k-means: severely affected by the presence of noise and outliers in the data. But for MDS, classification should be used K-means, because k-means is also based on euclidean distance