HetzDra / turboGliph

R implementation of GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots), an algorithm developed by Glanville et al to identify specificity groups in the T cell receptor repertoire based on local (motif sharing) and global (hamming distance) similarities.
18 stars 4 forks source link

Question about gliph2 output #4

Open canuckafar opened 2 years ago

canuckafar commented 2 years ago

Hello and thanks for turboGliph! Here are a couple of issues that I have noted and am trying to understand. I use the included package dataset for reproducibility.

library(turboGliph)
utils::data("gliph_input_data")
res <- gliph2(cdr3_sequences = gliph_input_data,
                   sim_depth = 100, n_cores = parallel::detectCores())

cluster_size and unique_CDR3b

Consider the output contained in the res list of global clusters.

> head(res$global_enrichment)
  cluster_tag cluster_size unique_CDR3b num_in_ref fisher.score aa_at_position TRBV                                  CDR3b
1      %AGVNE            2            2          2 9.356598e-04              A                   CSYAAGVNEQFF CSYAAGVNEQYF
2      %DFYNE            3            3          1 7.907404e-06              E      CSVEDFYNEQFF CSVEDFYNEQLF CSVEDFYNEQSF
3    %DISGYNE            2            2          0 1.585917e-04              E               CSVEDISGYNEQFF CSVEDISGYNEQYF
4      %DSQNE            2            2          0 1.585917e-04              S                   CASSDSQNEQFF CASSDSQNEQYF
5  %EETGEGGHE            2            2          0 1.585917e-04              S           CASSEETGEGGHEQFF CASSEETGEGGHEQYF
6   %EGQVAPGE            2            2          0 1.585917e-04              S             CASSEGQVAPGEQYF CASSEGQVAPGELFF

Consider now the 2nd pattern listed %DFYNE (cluster_tag). The output suggests that there are three unique patients within the cluster (cluster_size), 3 unique CDR3b sequences within the cluster (unique_CDR3b), and something called num_in_ref that I am not sure what it represents.

Now if I look for pattern %DFYNE within the input data, by using "." as a wildcard instead of "%", I retrieve the following sequences.

> match1 <- str_which(gliph_input_data$CDR3b, ".DFYNE") 
> gliph_input_data[match1, ]
            CDR3b TRBV TRBJ CDR3a TRAV TRAJ      patient counts    HLA antigen.species antigen
1737 CSVEDFYNEQFF TRBV TRBJ  <NA> <NA> <NA> subject-1mc4      3 HLA-A2             EBV   BMLF1
1738 CSVEDFYNEQFL TRBV TRBJ  <NA> <NA> <NA> subject-1mc4      3 HLA-A2             EBV   BMLF1
1739 CSVEDFYNEQFS TRBV TRBJ  <NA> <NA> <NA> subject-1mc4      3 HLA-A2             EBV   BMLF1
1740 CSVEDFYNEQLF TRBV TRBJ  <NA> <NA> <NA> subject-1mc4      3 HLA-A2             EBV   BMLF1
1741 CSVEDFYNEQSF TRBV TRBJ  <NA> <NA> <NA> subject-1mc4      3 HLA-A2             EBV   BMLF1

>unique( gliph_input_data[match1,]$CDR3b )
[1] "CSVEDFYNEQFF" "CSVEDFYNEQFL" "CSVEDFYNEQFS" "CSVEDFYNEQLF" "CSVEDFYNEQSF"

This suggests that there are 5 unique TCRs that should have fit in the cluster (not 3). The unique TCRs are all derived from the same patient which means that cluster_size should have been 1 (not 3), for this cluster. Am I missing something here?

Now let's look at another row (row 9 ) of the gliph2 output. I selected this row because of its high num_in_ref value.

> res$global_enrichment[9, ]
  cluster_tag cluster_size unique_CDR3b num_in_ref fisher.score aa_at_position TRBV                     CDR3b
9      %GQGYE            2            2          9  0.008088243              S      CAWSGQGYEQFF CAWSGQGYEQYF

Again, let's find the pattern matches in the input data.

> match2 <- str_which(gliph_input_data$CDR3b, ".GQGYE") 
> gliph_input_data[match2, ]
              CDR3b TRBV TRBJ CDR3a TRAV TRAJ       patient counts    HLA antigen.species antigen
885  CASSQGGQGYEQYF TRBV TRBJ  <NA> <NA> <NA>  subject-1637      3 HLA-A2             flu      M1
1010   CAWSGQGYEQFF TRBV TRBJ  <NA> <NA> <NA>  subject-1637      3 HLA-A2             flu      M1
1011   CAWSGQGYEQYF TRBV TRBJ  <NA> <NA> <NA>  subject-1637      3 HLA-A2             flu      M1
1244  CACAPGQGYEQYF TRBV TRBJ  <NA> <NA> <NA> subject-1mc13      3 HLA-A2             EBV   BMLF1

> unique( gliph_input_data[match2,]$CDR3b )
[1] "CASSQGGQGYEQYF" "CAWSGQGYEQFF"   "CAWSGQGYEQYF"   "CACAPGQGYEQYF" 

Here cluster_size is 2 which is correct in the output. Also 'unique_CDR3b` is 2, but should be 4.

num_in_ref

Looking now at num_in_ref, I am not sure how to interpret it base on the value 1 for the first case above and 9 for the second case. I see nothing in the gliph_input_data that can help. I thought this might represent frequencies of the TCRs in the source data but if you sum counts in the data (which are intriguingly all 3), they cannot explain this variable.

Appreciate any time you may have to look into these issues/questions.

HetzDra commented 2 years ago

Dear @canuckafar,

thank you for your interest in using and understanding turboGliph.

Up front, the GLIPH algorithms are of a complex nature and the functions in turboGliph have accordingly many input parameters to customize the analysis of your data. In the folder vignettes on our github page there is a HTML file that breaks down the complex output of the functions and explains the meaning of the individual lists, data frames and columns. The input parameters of the functions are explained on the help pages of each function. However, we plan to update the vignette soon to present the input parameters there briefly as well, thus providing a complete overview of the usage of the package in one file. But now about clarifying your questions.

Difference between cluster_size and unique_CDR3b

See the following excerpt from the vignette (chapter 6.2.2):

cluster_size : number of sequences in this cluster unique_CDR3b : number of unique sample sequences with this structure

The parameter unique_CDR3b specifies how many unique sequences there are with the current structure (for global patterns I speak of structures, for local patterns of motifs) in your samples. In a cluster, the function gliph2 finally merges all sequences with this structure from your sample, but this time together with additional information, like V-gene or patient ID. Therefore, if an identical sequence occurs more than once in your sample, it may appear more than once in the cluster. The parameter cluster_size now specifies how many members the cluster has.

Output parameter num_in_ref

This parameter specifies how many sequences from the reference database contain the respective pattern. With this value the statistics are calculated afterwards.

Discrepancy for the structure %DFYNE

The difference of three detected unique TCRs by the algorithm compared to the 5 TCRs in the sample is due to the input parameter accept_sequences_with_C_F_start_end. This specifies that only complete CDR3b sequences with C as the first amino acid and F as the last amino acid are used for analysis. The sequences CSVEDFYNEQFL and CSVEDFYNEQFS do not fulfill this criterion, which is why only the three sequences mentioned in the column `res$global_enrichment`$CDR3b are counted.

Discrepancy for the structure %GQGYE

Here it is now relevant that we are dealing with global similarities. During the analysis, the first three and the last three amino acids of a sequence are not considered. In the case of a global similarity, the remaining amino acid sequence with a variable position is now defined as the structure. Only sequences that contain exactly the structure except for the variable position have a global similarity. In your example, the sequences CASSQGGQGYEQYF (structure SQGGQGYE) and CACAPGQGYEQYF (structure APGQGYE) do not correspond exactly to the structure %GQGYE, since the sequences are too long.

canuckafar commented 2 years ago

Thank you for the terrific explanations. This is extremely helpful. Many of these types of details are not found in the original GLIPH and GLIPH2 publications, so this is really great.

canuckafar commented 2 years ago

I have encountered an error when using turboGliph with my dataset. Any idea what the error message below is telling me?

First, here are the structures of my datasets:

> head(tcr_gliph_tum)
# A tibble: 6 × 11
  Sample            CDR3b           TRBV     TRBJ    counts HLA   Arm      Mouse Organ      Organ_short patient                  
  <chr>             <chr>           <chr>    <chr>    <dbl> <chr> <chr>    <dbl> <chr>      <fct>       <chr>                    
1 Mouse1_left_flank CASSEWGASYAEQFF TRBV13-1 TRBJ2-1   1451 H-2KB symphony     1 flank_left FL          Mouse1:symphonyflank_left
2 Mouse1_left_flank CASSQGYSDYTF    TRBV12-2 TRBJ1-2   1237 H-2KB symphony     1 flank_left FL          Mouse1:symphonyflank_left
3 Mouse1_left_flank CASSQEGDTNSPLYF TRBV2-1  TRBJ1-6   1029 H-2KB symphony     1 flank_left FL          Mouse1:symphonyflank_left
4 Mouse1_left_flank CASSDKGQDTQYF   TRBV13-3 TRBJ2-5    931 H-2KB symphony     1 flank_left FL          Mouse1:symphonyflank_left
5 Mouse1_left_flank CASSLVTANSDYTF  TRBV3-1  TRBJ1-2    690 H-2KB symphony     1 flank_left FL          Mouse1:symphonyflank_left
6 Mouse1_left_flank CASSRDRGDTQYF   TRBV17-1 TRBJ2-5    665 H-2KB symphony     1 flank_left FL          Mouse1:symphonyflank_left

> head(reference_list_mouse$cd4$refseqs)
# A tibble: 6 × 3
  CDR3b          TRBV      TRBJ    
  <chr>          <chr>     <chr>   
1 CAAGDWGGAETLYF mTRBV31   mTRBJ2-3
2 CAAGGAQDTQYF   mTRBV13-1 mTRBJ2-5
3 CAAGTGNTEVFF   mTRBV13-1 mTRBJ1-1
4 CAAQGAVNERLFF  mTRBV13-1 mTRBJ1-4
5 CAAQRGQNTGQLYF mTRBV13-1 mTRBJ2-2
6 CACDREGEVFF    mTRBV1    mTRBJ1-1

Now the call to the gliph2 function.

> gl2 <- gliph2(cdr3_sequences = tcr_gliph_tum, 
+               result_folder = '../data/meta/gliph2',
+               refdb_beta = data.frame(reference_list_mouse$cd4$refseqs),
+               n_cores = 32)
Notification: cdr3_sequences is a data frame and the column named "CDR3b" is considered as cdr3 beta sequences.
Notification: Column of reference database named 'CDR3b' is considered as cdr3 sequences.
Warning: Reference database must have more sequences than input data for the significance analysis to make sense and work.
Part 1: Searching for local similarities.
154 significantly enriched motifs found in sample set.
Part 1 cpu time: 10.65799 secs 

Part 2: Searching for global similarities.
17645 clusters based on global similarity found in sample set.
Part 2 cpu time: 29.37489 secs 

Part 3: Clustering sequences.
Part 3 cpu time: 59.75831 secs 

Part 4: Scoring convergence groups
Error in { : task 1 failed - "missing value where TRUE/FALSE needed"

Of note, this dataset (tcr_gliph_tum) is a subset of a much larger dataset did work on the gliph2 website. However, I wanted to use functions in your package, like plotting, and the output from the website is very different in structure to that of turboGliph, hence the desire to use your function.

HetzDra commented 2 years ago

Dear @canuckafar,

I am sorry that an error occurred in the package during your use. After extensive analysis, I was able to reproduce your listed error message by introducing NA values in the HLA or patient information. This error has now been fixed and the function is able to handle NA values.

Can you please confirm if NA values were present in your input and the analysis is now fully working again?

The package has already undergone extensive testing. If you receive further error messages, please contact me so that I can fix them.

Best, Jan

canuckafar commented 2 years ago

Jan,

First, you do not need to be sorry about this kind of thing. I can only imagine how many hours it took to build this software that you provide to people like me for free.

Second, you were correct about missing HLA and patient IDs being the issue. I removed the input rows with missing data and gliph2 ran great. For my dataset it took about 40 min on a 32 core PC to get the results. The network plot is now the item that I am working on. Seems to take quite a white to generate for my dataset. Will try restricting the size of the clusters.

Out of curiosity, is there a way to calculate a similarity or dissimilarity matrix between the clusters so that I can get a proper network plot?

BI

canuckafar commented 2 years ago

Jan,

One issue with GLIPH2 (in general, not your implementation) is that it can find many TCR clusters, the vast majority of which of are dubious relevance to biology. Chiou et al. have suggested that clusters that might be more relevant are those that:

  1. Contain more unique clonotypes (TCR sequences)
  2. Occur in more unique individuals (patients or samples)
  3. Show more V-gene bias
  4. Contain clonotypes that are expanded in tumor versus normal tissue

I think this is reasonable advice. In my project, even when setting cluster_min_size = 5 in turboGliph::gliph2, I have 12,777 clusters.

> gl2 <- gliph2(cdr3_sequences = tcr_gliph, 
              result_folder = '../data/meta/gliph2_turbo',
              cluster_min_size = 5,
              refdb_beta = data.frame(reference_list_mouse$cd4$refseqs),
              n_cores = 32)

> summary(gl2$cluster_properties$cluster_size)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.000   5.000   7.000   9.054  10.000 271.000 

> hist(gl2$cluster_properties$cluster_size, breaks = 100, las = 1, probability = T,
     xlab = 'Cluster size', xlim = c(0, 100), main = NA)
> lines(density(gl2$cluster_properties$cluster_size), col = 'black')

The distribution of cluster sizes is shown below.

Screenshot from 2022-10-02 17-27-09

When I try to plot my data with plot_network, R becomes unresponsive, even with 32 cores allocated to the process and setting the minimum cluster size to 25.

> plot_network(gl2, cluster_min_size = 25, n_cores = 32)

I have a couple of additional questions and suggestions for you if you don't mind:

1) Any advice on getting plot_network to work better with a large dataset like mine?

2) Is there a way to subset the output of gliph2 to meet the 4 criteria noted above (or other criteria that a user might want) prior to plotting? For example, a function select_gliph that could select clusters that meet certain criteria from the cluster_properties list like cluster_size, fisher.score, vgene.score, etc....? This would be very helpful to use for downstream analyses and for subsetting prior to passing the data to plot_network. It would be great if it worked like dplyr, using my data above as an example

gl2_subset <- gl2 %>%
   select_gliph(vgene.score < 0.05, cluster_size >10, fisher.score < 0.001)
plot_network(gl2_subset)

3) Is there a way to add a property to the cluster_properties that is "unique_patients" that counts the number of unique patients/samples in the cluster? For example, a cluster that has 10 unique T cell clonotypes (by aa sequence) all arising from one patient's blood sample is different than a cluster that has 10 unique T cell clonotypes arising from 5 unique patient blood samples. This gets to the public vs private clonotype thing. This is different than comparing to the reference dataset too, which might not contain TCRs related to the those in the sample. For instance, I work in oncology. The T cells that I sequence from within cancers are likely quite different than those in the blood of patients with COVID or other infections. Since reference databases of TCR sequences and antigens are often from infection-related research, their applicability to oncology is unclear. Therefore, when I compare clonotypes from tumor-infiltrating T cells to reference T cells from infection patients, I likely am not going to find much similarity and thus num_in_ref will be small. But between my samples, all from similar patients or even from multiple sites within a single patient, there may be shared/public immune repertoires that might explain clinical features. For example, if responders to treatment had high levels of T cells from a particular cluster but non-responders did not, this might tell us about what the target tumor antigen is. Anyway, just a suggestion that would be pretty easy to implement I think.

BI