PMID:28423320 - Githubissues

mikessh commented 6 years ago

Chen G, Yang X, Ko A, Sun X, Gao M, Zhang Y, Shi A, Mariuzza RA, Weng NP. Sequence and Structural Analyses Reveal Distinct and Highly Diverse Human CD8+ TCR Repertoires to Immunodominant Viral Antigens. Cell Rep. 2017 Apr 18;19(3):569-583. doi: 10.1016/j.celrep.2017.03.072.

RenskeVroomans commented 6 years ago

These sequences consistently lack the starting C and final F. Can I simply add these or should I really check all via the provided accession number for the nucleotide sequence? Also, the sequence data from all donors are pooled and the frequencies indicated are over these pooled data. This means a lot of reads with frequency lower than 0.1%. Given that they claim to use high-quality read data (and used UMI ), should I also include these low-frequency reads, and if not, what is an appropriate cut-off in this case?

mikessh commented 6 years ago

Looks like I've missed that. C/F issue is not a problem, as this is human we just add them (mice have C/W in some J if I remember it correctly). Yes, lets put everything in first submission. We should carefully mark paired/single-cell records.

RenskeVroomans commented 6 years ago

Ok, I am almost done with this. I had a hard time figuring out which methods were applied where, so here's some debugging questions:

Their biggest table, S3, has the alpha and beta chain data displayed separately, without data from the donor. I took this to mean that these are not single-cell sequences. However, for some of the sequences they do indicate pairings between alpha and beta, and I am unsure how they came by it. After all, there seems to be some match between the pairs listed in S3 and those found by single-cell sequencing in S5, but in S3 there are many more pairs. Perhaps I could list these pair indications as a comment, rather than actually pairing them?
Sequencing was done as in Shugay et al 2014 --> rna-seq?
Their methods are somewhat confused as they analyse sequences from cells directly sorted from PBMCs and from cells that have been stimulated with antigen and then sorted. I think the sequencing data only comes from the latter group and I have therefore listed all sequences as "antigen-loaded-targets,dextramer-sort".
I merged data from table S5 and table 1, as only a few sequences in 1 were not listed in S5. For the sequences in table 1, the binding of TCRs has been tested with surface plasmon resonance. Is this a viable option for the methods.verification column?

mikessh commented 6 years ago

One can try to perform a frequency-based pairing, i.e. top TRB and TRA are likely to be paired if they are of approx. the same frequency. Better to put this as a comment if there is no direct evidence for using single cell/other mean of pairing.
This is amplicon seq: 5'RACE with UMI-tagging, then high-throughput sequencing.
Seems correct
Please call this method direct. I will add that direct also means direct TCR affinity measurement to the specification (README.md)

mikessh commented 6 years ago

I've just discovered extremely strange artefacts - cysteines inside CDR3, strange J alignments. Downloaded raw data and re-analyzing it. Should replace the chunk soon (except for paired sequences)..

RenskeVroomans commented 6 years ago

My apologies, I did not notice this before

mikessh commented 6 years ago

No problems, I've also missed this, quite hard to check all 10k sequences :)

mikessh commented 6 years ago

After checking, it looks like those Cys codons are mostly in N-region of TRA CDR3s, so perhaps they are real. They have quite low frequency (<0.1%), but supported by a large number of reads (say 100-300 raw reads). Also I've checked two donors that gave an extremely diverse repertoire (one had 3k unique clonotypes) - none of the top clonotypes appear to be some sort of "public" TCRs, so they are likely real antigen specific.

So perhaps we should not do anything here, or just remove all records with low (<=10^-4) frequency

PS. Author CDR3s (mapped using MIGEC), and CDR3s mapped from raw data using MIXCR by me appear to be more or less the same

antigenomics / vdjdb-db

PMID:28423320 #226