antigenomics / vdjdb-db

🗂️ [vdjdb.cdr3.net is up and running] Git-based TCR database storage & management. Submissions welcome!
https://vdjdb.cdr3.net
Other
126 stars 27 forks source link

PMID:28423320 #226

Open mikessh opened 6 years ago

mikessh commented 6 years ago

Chen G, Yang X, Ko A, Sun X, Gao M, Zhang Y, Shi A, Mariuzza RA, Weng NP. Sequence and Structural Analyses Reveal Distinct and Highly Diverse Human CD8+ TCR Repertoires to Immunodominant Viral Antigens. Cell Rep. 2017 Apr 18;19(3):569-583. doi: 10.1016/j.celrep.2017.03.072.

RenskeVroomans commented 6 years ago

These sequences consistently lack the starting C and final F. Can I simply add these or should I really check all via the provided accession number for the nucleotide sequence? Also, the sequence data from all donors are pooled and the frequencies indicated are over these pooled data. This means a lot of reads with frequency lower than 0.1%. Given that they claim to use high-quality read data (and used UMI ), should I also include these low-frequency reads, and if not, what is an appropriate cut-off in this case?

mikessh commented 6 years ago

Looks like I've missed that. C/F issue is not a problem, as this is human we just add them (mice have C/W in some J if I remember it correctly). Yes, lets put everything in first submission. We should carefully mark paired/single-cell records.

RenskeVroomans commented 6 years ago

Ok, I am almost done with this. I had a hard time figuring out which methods were applied where, so here's some debugging questions:

mikessh commented 6 years ago
  1. One can try to perform a frequency-based pairing, i.e. top TRB and TRA are likely to be paired if they are of approx. the same frequency. Better to put this as a comment if there is no direct evidence for using single cell/other mean of pairing.
  2. This is amplicon seq: 5'RACE with UMI-tagging, then high-throughput sequencing.
  3. Seems correct
  4. Please call this method direct. I will add that direct also means direct TCR affinity measurement to the specification (README.md)
mikessh commented 6 years ago

I've just discovered extremely strange artefacts - cysteines inside CDR3, strange J alignments. Downloaded raw data and re-analyzing it. Should replace the chunk soon (except for paired sequences)..

RenskeVroomans commented 6 years ago

My apologies, I did not notice this before

mikessh commented 6 years ago

No problems, I've also missed this, quite hard to check all 10k sequences :)

mikessh commented 6 years ago

After checking, it looks like those Cys codons are mostly in N-region of TRA CDR3s, so perhaps they are real. They have quite low frequency (<0.1%), but supported by a large number of reads (say 100-300 raw reads). Also I've checked two donors that gave an extremely diverse repertoire (one had 3k unique clonotypes) - none of the top clonotypes appear to be some sort of "public" TCRs, so they are likely real antigen specific.

So perhaps we should not do anything here, or just remove all records with low (<=10^-4) frequency

PS. Author CDR3s (mapped using MIGEC), and CDR3s mapped from raw data using MIXCR by me appear to be more or less the same