HetzDra / turboGliph

R implementation of GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots), an algorithm developed by Glanville et al to identify specificity groups in the T cell receptor repertoire based on local (motif sharing) and global (hamming distance) similarities.
17 stars 4 forks source link

Different clusters(tag) contain the same TCR sequence #6

Open XiangweiZhai opened 1 year ago

XiangweiZhai commented 1 year ago

Hi Thank you so much for developing this comprehensive and efficient package! In my understanding, if a few CDR3b sequences are assigned to the same cluster and are labeled the identical tag, it is because these CDR3b sequences have similar protein structures and can recognize the same antigen. So, it is impossible for a given sequence to have multiple structures appearing in different clusters. turbo_gliph()'s results are in line with my ideas, but gliph2()'s results are unexpected.

res_gliph2 <- turboGliph::gliph2(cdr3_sequences = gliph_input_data, n_cores = 10) gliph2Properties=res_gliph2$cluster_properties seqMatch=str_detect(gliph2Properties$members,"CANSPTSSTTSYEQYF") gliph2Properties[seqMatch,] %>% select(type,tag,cluster_size,members)

    type          tag cluster_size                                                                                                                members
7    local     STTS_7_8            5                                   CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF CASSSTSSTTSYEQYF CASSTTSSTTSYEQYF
70   local    TTSY_4_20            5                                   CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF CASSSTSSTTSYEQYF CASSTTSSTTSYEQYF
71   local    TSST_4_20            7 CANSPTSSTTSYEQYF CASSPTSSTHSYEQYF CASSPTSSTPSYAQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF CASSSTSSTTSYEQYF CASSTTSSTTSYEQYF
72   local    SSTT_4_20            5                                   CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF CASSSTSSTTSYEQYF CASSTTSSTTSYEQYF
77   local    PTSS_4_20            5                                   CANSPTSSTTSYEQYF CASSPTSSTHSYEQYF CASSPTSSTPSYAQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
120 global %PTSSTTSYE_S            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
256 global S%TSSTTSYE_P            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
375 global SP%SSTTSYE_T            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
397 global SPT%STTSYE_S            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
399 global SPTS%TTSYE_S            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
400 global SPTSS%TSYE_T            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
401 global SPTSST%SYE_T            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
402 global SPTSSTT%YE_S            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
403 global SPTSSTTS%E_Y            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF
404 global SPTSSTTSY%_E            3                                                                     CANSPTSSTTSYEQYF CASSPTSSTTSYEQHF CASSPTSSTTSYEQYF

Why the one "CANSPTSSTTSYEQYF" appear in 15 different cluster and have corresponding tags? The same thing happens in sequences:"CNARGQAITEKLFF","CASSPWGQTASSYNEQFF","CASSIRSAYEQYF"......

BenSolomon commented 9 months ago

Hi @XiangweiZhai - This is most likely expected behavior of GLIPH2. As discussed in the original GLIPH publication: "In this new version, first, a TCR can be assigned to more than one cluster." Two tests you could try to confirm this are: