HetzDra / turboGliph

R implementation of GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots), an algorithm developed by Glanville et al to identify specificity groups in the T cell receptor repertoire based on local (motif sharing) and global (hamming distance) similarities.
18 stars 5 forks source link

cluster tag fields #9

Open ndsimons opened 1 day ago

ndsimons commented 1 day ago

Hello, I recently ran turboGliph::gliph2 and I'm looking at the outputs - in $cluster_list there is a list of cluster tags like 'YQK_4_17'. I get that YQK is the motif, but what are 4 and 17 in this example?

Thanks very much!

HetzDra commented 1 day ago

Hey there,

thank you for your interest in our package.

The quick answer to your question is the following excerpt from the vignette that comes with this package:

As shown, several pieces of information are provided for each cluster:

...

  • tag : tag of the cluster. Composed of the motif, the first and the last N-terminal starting position of the motif for local similarities. For global similarities, it is composed of the global structure, the V gene if necessary and all unique amino acids at the variable position. Information are separated by underscores.

...

In your example, this means that the cluster “YQK_4_17” contains all CDR3 sequences that contain the motif “YQK” if it starts between position 4 and position 17 in the CDR3 sequence. Sequences that contain the motif, but for example only starting from position 19 in the CDR3 sequence, do not belong in this cluster.

If you are interested in why this information is needed, I will give you a more detailed explanation below:

In the GLIPH algorithms, there are restrictions for both local and global similarities that limit membership of a cluster. For local similarity, the informative value of a motif is much greater if not only the motif itself but also the position of the motif within the CDR3 sequence is preserved. For this reason, the parameter “motif_distance_cutoff” can be used when calling gliph or gliph2 to regulate how far away the starting position of the motif between two sequences may be in order to still be classified as a local similarity. By default, this cutoff is limited to a range of three amino acids. A motif can have different functions for antigen recognition at the start of a CDR3 region than at the end of a CDR3 region. For example, if one and the same motif in a data set always starts at position 5 or at position 19, then the sequences with position 5 are clustered together in a cluster “YQK_5_5”, and the sequences with position 19 in a cluster “YQK_19_19”. These clusters presumably also have different biological significance. In order to keep the clusters with the same motif but different positions apart, these tags have this additional information.

For global similarities, depending on the settings of the algorithm, the global similarities also include restrictions for the V-gene similarity between sequences or the nature and similarity of the amino acid that is NOT shared. All this information is summarized for the user in the tag of the cluster.

I hope I was able to help you answer your question.

Best regards, HetzDra