The primary goal of VDJdb is to facilitate access to existing information on T-cell receptor antigen specificities, i.e. the ability to recognize certain epitopes in certain MHC contexts.
Our mission is to both aggregate the scarce TCR specificity information available so far and to create a curated repository to store such data.
In addition to routine database updates providing the most up-to-date information, we make our best to ensure data consistency and fight irregularities in TCR specificity reporting with a complex database validation scheme:
This repository hosts the submissions to database and scripts to check, fix and build the database itself.
To build database directly from submissions, go to src
directory and run groovy -cp . BuildDatabase.groovy
script (requires Groovy).
To query the database for your immune repertoire sample(s) use the vdjmatch software.
A web-based GUI for the database can be found in VDJdb-web repository.
Please cite the database using the most recent paper Mikhail Goncharov, Dmitry Bagaev, Dmitrii Shcherbinin, Ivan Zvyagin, Dmitry Bolotin, Paul G. Thomas, Anastasia A. Minervina, Mikhail V. Pogorelyy, Kristin Ladell, James E. McLaren, David A. Price, Thi H. O. Nguyen, Louise C. Rowntree, E. Bridie Clemens, Katherine Kedzierska, Garry Dolton, Cristina Rafael Rius, Andrew Sewell, Jerome Samir, Fabio Luciani, Ksenia V. Zornikova, Alexandra A. Khmelevskaya, Saveliy A. Sheetikov, Grigory A. Efimov, Dmitry Chudakov & Mikhail Shugay. VDJdb in the pandemic era: a compendium of T cell receptors specific for SARS-CoV-2. Nature Methods 2022.
doi:10.1038/s41592-022-01578-0.
To submit previously published sequence follow the steps below:
Create an issue(s) labeled as paper
and named by the paper pubmed id, PMID:XXXXXXX
. Note that if paper is a meta-study, you can mark it as meta-paper
and link issues for its references in a reply to this issue. Also note that in case submitting unpublished sequences, choose any appropriate issue name with details on submitter (name, organization, etc) in issue comments.
Create new branch and add chunk(s) for corresponding papers named as PMID_XXXXXXX
. Don't forget to close/reference corresponding issues in the commit message.
Create a pull request for the branch and check if it passes the CI build. If there are any issues, modify them by fixing/removing entries as necessary.
The structure of submission chunk is provided below, but first a couple of notes:
STYLE Try avoiding spaces (e.g.
TRBV7,TRBV5
, notTRBV7, TRBV5
) and leave fields that have no information as blank (don't use any placeholder). Stick to listed field values at all cost! In case a critical part of your submission doesn't fit in current specification: 1) Create an issue in the issues section (and tag it asmaintainance
), 2) provide us with an example (e.g. open a pull request). Do not insert critical information into the comment field.FORMAT Please ensure that Variable/Joining and MHC names in your submission come from IMGT nomenclature (this does not apply to donor MHC typing fields).
The BuildDatabase
routine will be executed during CI tests upon each submission and prior to every database release implements table format checks, CDR3 sequence checks and fixes (if possible), and VDJdb confidence score assignment (see below).
To view the list of papers that were not yet processed follow here.
An XLS template is available here.
CAUTION make sure that nothing is messed up (
x/X
frequencies are transformed to dates, bad encoding, etc) when importing from XLS template. The format of all fields is pre-set to text to prevent this case.
Each database submission in chunks/
folder should have the following header and columns:
These columns convey full information about TCR:peptide:MHC complex and are mandatory for any submission.
column name | description |
---|---|
cdr3.alpha | TCR alpha CDR3 amino acid sequence. Complete sequence starting with C and ending with F/W should be provided if possible. Trimmed sequences will be fixed at database building stage in case sufficient V/J germline parts are present |
v.alpha | TCR alpha Variable (V) segment id, up to best resolution possible (TRAVX*XX , e.g. TRAV7 , TRAV7*01 , TRAV7*02 ...). Strictly IMGT nomenclature. Can be left blank if unknown. |
j.alpha | TCR alpha Joining (J) segment id |
cdr3.beta | TCR beta CDR3 amino acid sequence |
v.beta | TCR beta V segment id |
j.beta | TCR beta J segment id |
species | TCR parent species (HomoSapiens , MusMusculus ,...) |
mhc.a | First MHC chain allele, to best resolution possible, HLA-X*XX:XX , e.g. HLA-A*02:01 |
mhc.b | Second MHC chain allele (B2M for MHCI) |
mhc.class | MHCI or MHCII |
antigen.epitope | Amino acid sequence of the epitope |
antigen.gene | Parent gene of the epitope sequence (e.g. pp24 ) |
antigen.species | Parent species of the antigen, to the best clade resolution possible (e.g. HIV-1 , HIV-1*HXB2 ) |
reference.id | Pubmed id, doi, etc |
submitter | Name of submitting person/organization |
Notes:
In case given record represents a clonotype with either TCR alpha or beta sequence unknown, missing CDR3/V/(D)/J fields should be left blank.
V/(D)/J fields can be left blank, however this will abrogate CDR3 fixing/verification procedure for a given record.
Any record should have at least one of CDR3 alpha/beta fields that are not blank.
Optional columns (i.e. it is not required to fill them, but they should be present in table header) that ensure correct confidence ranking of a given entry. Used to calculate a single confidence score based on various factors, e.g. fraction of a given TCRab sequence among tetramer+ clones sequenced and verification experiments performed.
column name | description |
---|---|
method.identification | tetramer-sort , dextramer-sort , pelimer-sort , pentamer-sort , etc for sorting-based identification. For molecular assays use: antigen-loaded-targets (if T cells specificity was analysed against cells incubatetd with antigenic peptide), antigen-expressing-targets (if T cells specificity was analysed against cells tranformed with antigenic organism, protein or peptide, e.g. BCL transformed with EBV). For magnetic cell separation use beads keyword. Add cultured-T-cells or limiting-dilution-cloning if T cells were cultured before sequencing as in this case method.frequency will have completely different meaning. Use comma to separate phrases. For cases that use UMI-tagged multimers use tetramer-umi , etc. |
method.frequency | Frequency in isolated antigen-specific population, reported as X/X if possible, e.g. 7/30 if a given V/D/J/CDR3 is encountered in 7 out of 30 tetramer+ clones. Formats X% , X.X% and X.X are also supported. |
method.singlecell | yes if single cell sequencing was performed, blank otherwise |
method.sequencing | Sequencing method: sanger , rna-seq or amplicon-seq |
method.verification | tetramer-stain , dextramer-stain , pelimer-stain , pentamer-stain , etc for methods that include TCR cloning and re-staining with multimers. For magnetic cell separation use beads keyword. restimulation , co-culture , antigen-loaded-targets , antigen-expressing-targets for molecular assays that validate specificity of cloned T-cell receptors. direct in case the affinity of TCRs of specific T-cells to the pMHC is quantified directly in some way. Several comma-separated verification methods can be specified. |
Notes:
In case
method.identification
is left blank, the record is automatically assigned with a lowest confidence score possible.For special cases such as CD8-null tetramers that utilize HLA with mutated residues that abrogate CD8 binding, specify
cd8null-tetramer
inmethod.identification
field rather than usingmhc.a
field.
During database build phase, the information from columns mentioned above is collapsed to a JSON string and stored in a single method
column, e.g.:
{
"identification":"tetramer-sort",
"frequency":"5/13",
"sequencing":"sanger",
"verification":"antigen-loaded-targets"
}
column name | description |
---|---|
meta.study.id | Internal study id |
meta.cell.subset | T-cell subset, free style, e.g. CD8+ , CD4+CD25+ |
meta.subset.frequency | Frequency of a given TCR sequence in specified cell subset, e.g. 5% means that the TCR sequence represents an expanded clone occupying 5% of CD8+ cells |
meta.subject.cohort | Subject cohort, free style, e.g. healthy or HIV+ . If possible, specify to what extent a healthy donor is healthy, e.g. CMV-seronegative . |
meta.subject.id | Subject id (e.g. donor1 , donor2 ,...) |
meta.replica.id | Replicate sample coming from the same donor, also applies for different time points, etc (e.g. 5mo ) |
meta.clone.id | T-cell clone id |
meta.epitope.id | Epitope id (e.g. FL10 ) |
meta.tissue | Tissue used to isolate T-cells: PBMC , spleen , etc. or TCL (T-cell culture) if isolated from re-stimulated T-cells |
meta.donor.MHC | Donor MHC list if available, blank otherwise. IMGT nomenclature (e.g. HLA-A*02:01) is preferable. Allele group names (e.g. A02 , B18 ) is also acceptable (don't use asterisk in such cases). Use comma to separate alleles. |
meta.donor.MHC.method | Donor MHC typing method if available, blank otherwise |
meta.structure.id | PDB structure ID if exists, or blank. Records having a structural data associated with them will automatically get the highest confidence score. |
comment | Plain text comment, maximum 140 characters |
Note:
While these columns are optional, subject identifier, replica identifier, etc are used when scanning submission for duplicates. Normally duplicate records (with identical complex information columns) are not allowed, but they will not be considered as duplicates in case they have distinct id fields mentioned above.
During database build phase, the information from columns mentioned above is collapsed to a JSON string and stored in a single meta
column, e.g.:
{
"cell.subset":"CD8+",
"subject.cohort":"HSV-2+",
"subject.id":12,
"clone.id":46,
"tissue":"PBMC"
}
Condition metadata:
column name | description |
---|---|
condition.name | natural language terms like T1D , pollen allergy , BRCA or YF vaccination |
condition.id | ICD-11:5A10 for T1D in ICD-11 or OMIM:114480 for breast cancer in OMIM |
condition.type | infection , vaccination , cancer , allergy or autoimmune |
condition.subtype | natural language terms like acute or poor prognosis or grade II |
Association metadata:
column name | description |
---|---|
condition.freq | fraction of samples matching the entry |
condition.count | number of samples matching the entry (can be blank) |
population.freq | fraction of controls matching the entry, or Pgen computed by OLGA/IgOR |
population.count | number of controls matching the entry (can be blank) |
association.pvalue | Association P-value, e.g. enrichment P-value for Fisher's exact test |
association.test | Fisher , TCRNET , ALICE or another statistical method |
Peptide pools, long peptides for T-cell culture expansion, non-peptide ligands
column name | description |
---|---|
antigen.epitope.long | encompassing protein sequence containing the epitope |
antigen.peptide.pool | e.g. MIRA COVID19 TBD |
antigen.nonpeptide | α-GalCer or KRN7000 TBD |
Information for non alpha-beta T-cells, CAR-T, etc
column name | description |
---|---|
v.delta | ID of Variable segment in delta chain |
cdr3.delta | CDR3 of delta chain |
... | ... |
v.heavy.shm | CIGAR string of hypermutations in the heavy chain Variable segment |
... | ... |
At this stage, a series of checks is performed for CDR3 sequence and reported V/J segments:
C
and ending with F/W
) CDR3 sequences: checks if 5' and 3' germline parts match corresponding V/J segment sequences.C/F/W
residues. Can add more missing residues in case a relatively large contiguous V/J germline match is present.FGXG
instead of simply F
at CDR3 3' part), excessive residues are removed.The main reason behind that is that current immune repertoire sequencing (RepSeq) data processing software reports canonical clonotype sequences, high number antigen-specific TCR sequences present in literature are reported inconsistently. The latter greatly complicates annotation of RepSeq data using known antigen-specific TCR sequences.
In case of good V/J germline matching and errors in CDR3 sequence, the final CDR3 sequence in the database is replaced by its fixed version. The following report of CDR3 fixer is placed under cdr3fix.alpha
and cdr3fix.beta
columns, e.g.
{
"fixNeeded":true,
"good":false,
"cdr3":"CASSQDVGTGGVFALYF",
"cdr3_old":"CASSQDVGTGGVFALY",
"jFixType":"FixAdd",
"jId":"TRBJ1-6*01",
"jCanonical":true,
"jStart":14,
"vFixType":"FailedBadSegment",
"vId":null,
"vCanonical":true,
"vEnd":-1
}
and
{
"fixNeeded":true,
"good":true,
"cdr3":"CASSLSRGGNQPQYF",
"cdr3_old":"CASSLSRGGNQPQY",
"jFixType":"FixAdd",
"jId":"TRBJ1-5*01",
"jCanonical":true,
"jStart":9,
"vFixType":"NoFixNeeded",
"vId":"TRBV14*01",
"vCanonical":true,
"vEnd":4
}
Field descriptions:
field | description |
---|---|
fixNeeded | true if corrected CDR3 sequence differs from the original one, false otherwise |
|
good | true if the fix can be applied, false if the fix cannot be applied due to bad V/J entry or no V/J matching |
|
cdr3 |
Fixed CDR3 sequence |
cdr3_old |
Original CDR3 sequence |
jFixType |
Type of fix applied to CDR3 J germline part |
jCanonical | true if CDR3 ends with F or W , false otherwise |
|
jId |
J segment identifier |
jStart |
A 0-based index of first CDR3 amino acid that belongs to J segment |
vFixType |
Type of fix applied to CDR3 V germline part |
vCanonical | true if CDR3 starts with C , false otherwise |
|
vId |
V segment identifier |
vEnd |
A 0-based index of the last CDR3 amino acid of V segment plus one |
Note:
Possible V and J fix types:
NoFixNeeded
,FixAdd
,FixReplace
,FixTrim
,FailedReplace
(too many mismatches),FailedBadSegment
(bad segment entry),FailedNoAlignment
(no alignment at all)
At the final stage of database processing, TCR:peptide:MHC complexes are assigned with confidence scores. Scores are computed according to reported method entries.
VDJdb scoring is performed by evaluating TCR sequence, identification and verification confidence based on the following criteria:
method.sequencing
and method.singlecell
(1-3 points)
method.frequency
) - 2 points, otherwise 10.01
- 2 points, otherwise 0method.identification
(0-1 point)
0.1
according to method.frequency
)0.5
method.frequency
becomes somewhat ambigous, check if it is higher than 0.5
meta.structure.id
is not empty) or some other method that directly evaluates TCR:pMHC binding1.
is set to 3The final score is then calculated as minimal between score from part 1.
and sum of scores from part 2.
and part 3.
.
Maximal score is then selected among different records (independent submissions, replicas, etc) pointing to the same unique complex entry (i.e. set of unique complex fields).
score | description |
---|---|
0 | Low confidence/no information - a critical aspect of sequencing/specificity validation is missing |
1 | Moderate confidence - no verification / poor TCR sequence confidence |
2 | High confidence - has some specificity verification, good TCR sequence confidence |
3 | Very high confidence - has extensive verification or structural data |
The final database assembly can be found in the database/
folder upon execution of BuildDatabase.groovy
script:
vdjdb_full.txt
- combined chunks with TCRalpha/beta records, antigen information, etc. All method and meta information are collapsed into two columns with corresponding names. VDJdb scores and CDR3 fixing information for TCR alpha and beta are given in separate columns. This is the raw version of VDJdb.vdjdb.txt
- a collapsed version of database used for annotation of single-chain TCR sequencing data by VDJdb-standalone software. Each line corresponds to either TCR alpha or TCR beta record as specified by the gene
column. TCR records coming from the same alpha-beta pair have the same index in complex.id
column. In case complex.id
is equal to 0
a record doesn't have either TCRalpha or TCRbeta chain information. This table is used by VDJdb-standalone and VDJdb-server.vdjdb.meta.txt
- metadata for vdjdb.txt
table, used by VDJdb-standalone and VDJdb-server.vdjdb.slim.txt
- a slim database used for annotation of single-chain TCR sequencing data by VDJdb-standalone software. This is a collapsed version of vdjdb.txt
containing unique records for each CDR3:antigen pair and comma-separated lists of values for other columns (*.segm
,mhc.*
, complex.id
and reference.id
). This table can be easily parsed with R and Python/Pandas, it is intended for end users exploring VDJdb.vdjdb.slim.meta.txt
- metadata for vdjdb.slim.txt
table.motif_pwms.txt
and cluster_members.txt
- position-weight matrices for antigen-specific TCR motifs and representative sets of TCR sequences that constitute them. These tables are computed separately using code from vdjdb-motifs repository.Note that some statistics can be generated by running R markdown templates in summary/
folder.
First make sure that you clone both vdjdb-db repo and vdjdb-motifs repo to the same folder, say ~/vcs
.
Then navigate to vdjdb-db
and run bash release.sh
. You can then find the output in ~/vcs/vdjdb-db/database
, ~/vcs/vdjdb-db/summary
and ~/vcs/vdjdb-motifs
folders. Note that you have to check .Rmd
files that will be executed and manually install missing R packages, as well as get VDJtools binary and place it in the path specified in ~/vcs/vdjdb-motifs/compute_vdjdb_motifs.Rmd
.
The repository contains Dockerfile
to simplify the database building process. Dockerfile
instantiates the correct environment needed to build the database.
If you have Docker Desktop installed and running on your machine use the following command to build local Docker image:
docker build -t vdjdbdb .
NOTE You may need sudo
to run docker.
In order to build the database using the newly created local Docker image create some folder (e.g. /tmp/output
) and use it as a external volume when running Docker image. Docker image always puts the result in /root/output
folder within docker container.
NOTE: Host path, e.g. /tmp/output
, should be absolute.
NOTE: Database building process requires at least 64GB of RAM.
mkdir -p /tmp/output
docker run -v /tmp/output:/root/output vdjdbdb
Pre-built images can be found at DockerHub, N.B. replace vdjdb
with mikessh/vdjdb:legacy
if running this image.