Open canuckafar opened 2 years ago
Dear @canuckafar,
thank you for your interest in using and understanding turboGliph.
Up front, the GLIPH algorithms are of a complex nature and the functions in turboGliph have accordingly many input parameters to customize the analysis of your data. In the folder vignettes on our github page there is a HTML file that breaks down the complex output of the functions and explains the meaning of the individual lists, data frames and columns. The input parameters of the functions are explained on the help pages of each function. However, we plan to update the vignette soon to present the input parameters there briefly as well, thus providing a complete overview of the usage of the package in one file. But now about clarifying your questions.
cluster_size
and unique_CDR3b
See the following excerpt from the vignette (chapter 6.2.2):
cluster_size : number of sequences in this cluster unique_CDR3b : number of unique sample sequences with this structure
The parameter unique_CDR3b
specifies how many unique sequences there are with the current structure (for global patterns I speak of structures, for local patterns of motifs) in your samples. In a cluster, the function gliph2 finally merges all sequences with this structure from your sample, but this time together with additional information, like V-gene or patient ID. Therefore, if an identical sequence occurs more than once in your sample, it may appear more than once in the cluster. The parameter cluster_size
now specifies how many members the cluster has.
num_in_ref
This parameter specifies how many sequences from the reference database contain the respective pattern. With this value the statistics are calculated afterwards.
The difference of three detected unique TCRs by the algorithm compared to the 5 TCRs in the sample is due to the input parameter accept_sequences_with_C_F_start_end. This specifies that only complete CDR3b sequences with C as the first amino acid and F as the last amino acid are used for analysis. The sequences CSVEDFYNEQFL and CSVEDFYNEQFS do not fulfill this criterion, which is why only the three sequences mentioned in the column `res$global_enrichment`$CDR3b
are counted.
Here it is now relevant that we are dealing with global similarities. During the analysis, the first three and the last three amino acids of a sequence are not considered. In the case of a global similarity, the remaining amino acid sequence with a variable position is now defined as the structure. Only sequences that contain exactly the structure except for the variable position have a global similarity. In your example, the sequences CASSQGGQGYEQYF (structure SQGGQGYE) and CACAPGQGYEQYF (structure APGQGYE) do not correspond exactly to the structure %GQGYE, since the sequences are too long.
Thank you for the terrific explanations. This is extremely helpful. Many of these types of details are not found in the original GLIPH and GLIPH2 publications, so this is really great.
I have encountered an error when using turboGliph with my dataset. Any idea what the error message below is telling me?
First, here are the structures of my datasets:
> head(tcr_gliph_tum)
# A tibble: 6 × 11
Sample CDR3b TRBV TRBJ counts HLA Arm Mouse Organ Organ_short patient
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <fct> <chr>
1 Mouse1_left_flank CASSEWGASYAEQFF TRBV13-1 TRBJ2-1 1451 H-2KB symphony 1 flank_left FL Mouse1:symphonyflank_left
2 Mouse1_left_flank CASSQGYSDYTF TRBV12-2 TRBJ1-2 1237 H-2KB symphony 1 flank_left FL Mouse1:symphonyflank_left
3 Mouse1_left_flank CASSQEGDTNSPLYF TRBV2-1 TRBJ1-6 1029 H-2KB symphony 1 flank_left FL Mouse1:symphonyflank_left
4 Mouse1_left_flank CASSDKGQDTQYF TRBV13-3 TRBJ2-5 931 H-2KB symphony 1 flank_left FL Mouse1:symphonyflank_left
5 Mouse1_left_flank CASSLVTANSDYTF TRBV3-1 TRBJ1-2 690 H-2KB symphony 1 flank_left FL Mouse1:symphonyflank_left
6 Mouse1_left_flank CASSRDRGDTQYF TRBV17-1 TRBJ2-5 665 H-2KB symphony 1 flank_left FL Mouse1:symphonyflank_left
> head(reference_list_mouse$cd4$refseqs)
# A tibble: 6 × 3
CDR3b TRBV TRBJ
<chr> <chr> <chr>
1 CAAGDWGGAETLYF mTRBV31 mTRBJ2-3
2 CAAGGAQDTQYF mTRBV13-1 mTRBJ2-5
3 CAAGTGNTEVFF mTRBV13-1 mTRBJ1-1
4 CAAQGAVNERLFF mTRBV13-1 mTRBJ1-4
5 CAAQRGQNTGQLYF mTRBV13-1 mTRBJ2-2
6 CACDREGEVFF mTRBV1 mTRBJ1-1
Now the call to the gliph2 function.
> gl2 <- gliph2(cdr3_sequences = tcr_gliph_tum,
+ result_folder = '../data/meta/gliph2',
+ refdb_beta = data.frame(reference_list_mouse$cd4$refseqs),
+ n_cores = 32)
Notification: cdr3_sequences is a data frame and the column named "CDR3b" is considered as cdr3 beta sequences.
Notification: Column of reference database named 'CDR3b' is considered as cdr3 sequences.
Warning: Reference database must have more sequences than input data for the significance analysis to make sense and work.
Part 1: Searching for local similarities.
154 significantly enriched motifs found in sample set.
Part 1 cpu time: 10.65799 secs
Part 2: Searching for global similarities.
17645 clusters based on global similarity found in sample set.
Part 2 cpu time: 29.37489 secs
Part 3: Clustering sequences.
Part 3 cpu time: 59.75831 secs
Part 4: Scoring convergence groups
Error in { : task 1 failed - "missing value where TRUE/FALSE needed"
Of note, this dataset (tcr_gliph_tum) is a subset of a much larger dataset did work on the gliph2 website. However, I wanted to use functions in your package, like plotting, and the output from the website is very different in structure to that of turboGliph, hence the desire to use your function.
Dear @canuckafar,
I am sorry that an error occurred in the package during your use. After extensive analysis, I was able to reproduce your listed error message by introducing NA values in the HLA or patient information. This error has now been fixed and the function is able to handle NA values.
Can you please confirm if NA values were present in your input and the analysis is now fully working again?
The package has already undergone extensive testing. If you receive further error messages, please contact me so that I can fix them.
Best, Jan
Jan,
First, you do not need to be sorry about this kind of thing. I can only imagine how many hours it took to build this software that you provide to people like me for free.
Second, you were correct about missing HLA and patient IDs being the issue. I removed the input rows with missing data and gliph2 ran great. For my dataset it took about 40 min on a 32 core PC to get the results. The network plot is now the item that I am working on. Seems to take quite a white to generate for my dataset. Will try restricting the size of the clusters.
Out of curiosity, is there a way to calculate a similarity or dissimilarity matrix between the clusters so that I can get a proper network plot?
BI
Jan,
One issue with GLIPH2 (in general, not your implementation) is that it can find many TCR clusters, the vast majority of which of are dubious relevance to biology. Chiou et al. have suggested that clusters that might be more relevant are those that:
I think this is reasonable advice. In my project, even when setting cluster_min_size = 5
in turboGliph::gliph2
, I have 12,777 clusters.
> gl2 <- gliph2(cdr3_sequences = tcr_gliph,
result_folder = '../data/meta/gliph2_turbo',
cluster_min_size = 5,
refdb_beta = data.frame(reference_list_mouse$cd4$refseqs),
n_cores = 32)
> summary(gl2$cluster_properties$cluster_size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.000 5.000 7.000 9.054 10.000 271.000
> hist(gl2$cluster_properties$cluster_size, breaks = 100, las = 1, probability = T,
xlab = 'Cluster size', xlim = c(0, 100), main = NA)
> lines(density(gl2$cluster_properties$cluster_size), col = 'black')
The distribution of cluster sizes is shown below.
When I try to plot my data with plot_network
, R becomes unresponsive, even with 32 cores allocated to the process and setting the minimum cluster size to 25.
> plot_network(gl2, cluster_min_size = 25, n_cores = 32)
I have a couple of additional questions and suggestions for you if you don't mind:
1) Any advice on getting plot_network
to work better with a large dataset like mine?
2) Is there a way to subset the output of gliph2
to meet the 4 criteria noted above (or other criteria that a user might want) prior to plotting? For example, a function select_gliph
that could select clusters that meet certain criteria from the cluster_properties list like cluster_size, fisher.score, vgene.score, etc....? This would be very helpful to use for downstream analyses and for subsetting prior to passing the data to plot_network
. It would be great if it worked like dplyr, using my data above as an example
gl2_subset <- gl2 %>%
select_gliph(vgene.score < 0.05, cluster_size >10, fisher.score < 0.001)
plot_network(gl2_subset)
3) Is there a way to add a property to the cluster_properties that is "unique_patients" that counts the number of unique patients/samples in the cluster? For example, a cluster that has 10 unique T cell clonotypes (by aa sequence) all arising from one patient's blood sample is different than a cluster that has 10 unique T cell clonotypes arising from 5 unique patient blood samples. This gets to the public vs private clonotype thing. This is different than comparing to the reference dataset too, which might not contain TCRs related to the those in the sample. For instance, I work in oncology. The T cells that I sequence from within cancers are likely quite different than those in the blood of patients with COVID or other infections. Since reference databases of TCR sequences and antigens are often from infection-related research, their applicability to oncology is unclear. Therefore, when I compare clonotypes from tumor-infiltrating T cells to reference T cells from infection patients, I likely am not going to find much similarity and thus num_in_ref
will be small. But between my samples, all from similar patients or even from multiple sites within a single patient, there may be shared/public immune repertoires that might explain clinical features. For example, if responders to treatment had high levels of T cells from a particular cluster but non-responders did not, this might tell us about what the target tumor antigen is. Anyway, just a suggestion that would be pretty easy to implement I think.
BI
Hello and thanks for turboGliph! Here are a couple of issues that I have noted and am trying to understand. I use the included package dataset for reproducibility.
cluster_size and unique_CDR3b
Consider the output contained in the res list of global clusters.
Consider now the 2nd pattern listed %DFYNE (
cluster_tag
). The output suggests that there are three unique patients within the cluster (cluster_size
), 3 unique CDR3b sequences within the cluster (unique_CDR3b
), and something callednum_in_ref
that I am not sure what it represents.Now if I look for pattern %DFYNE within the input data, by using "." as a wildcard instead of "%", I retrieve the following sequences.
This suggests that there are 5 unique TCRs that should have fit in the cluster (not 3). The unique TCRs are all derived from the same patient which means that
cluster_size
should have been 1 (not 3), for this cluster. Am I missing something here?Now let's look at another row (row 9 ) of the gliph2 output. I selected this row because of its high
num_in_ref
value.Again, let's find the pattern matches in the input data.
Here
cluster_size
is 2 which is correct in the output. Also 'unique_CDR3b` is 2, but should be 4.num_in_ref
Looking now at
num_in_ref
, I am not sure how to interpret it base on the value 1 for the first case above and 9 for the second case. I see nothing in thegliph_input_data
that can help. I thought this might represent frequencies of the TCRs in the source data but if you sumcounts
in the data (which are intriguingly all 3), they cannot explain this variable.Appreciate any time you may have to look into these issues/questions.