airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

How to represent PCR_target_locus from the 10X Genomics pipeline #417

Closed lcorrie closed 3 years ago

lcorrie commented 4 years ago

I am trying to represent T cell data that has gone through the 10X/Cell Ranger pipeline in an AIRR compliant format. Their pipeline uses PCR amplification, so I want to assign a value to pcr_target_locus. The repertoires contain both TRA and TRB data so I was wondering, for the pcr_target_locus field, is there a way to show that the repertoire contains both in a succinct way? I was thinking something like "TRA, TRB" might work since it follows the controlled vocabulary but the examples suggest that there should only be one...

bussec commented 4 years ago

It is correct that pcr_target_locus is a property of PCRTarget and as such can only contain a single string from the controlled vocabulary of loci. However, pcr_target (without the _locus) is a property of the NucleicAcidProcessing and contains an array of PCRTarget records. Therefore multiple loci can be annotated this way.

Having said this, you need to check whether the experimental protocol really performs an targeted amplification in the PCR step, or whether the substrate specificity is conferred by the RT step and the PCR only operates on generic linker sequences. In this case there is no pcr_target_locus (unless we change our definition)...

lcorrie commented 4 years ago

Thanks so much for your speedy answer!

From what I understand from the 10X Genomics protocol, they reverse transcribe and amplify all the full length cDNA and then use targeted PCR to enrich for either TCR or Ig sequences depending on which one you want. If I am correct in this assessment then there is a pcr_target_locus field according to your definition and the best approach may be to split the joint TCR data files into separate TRA and TRB repertoires so that they can correctly follow the controlled vocabulary.

I got the information about the 10X genomics protocol here, mostly from page 17. If you wouldn't mind, @bussec would you be able to take a quick look and see if you agree with how I have interpreted their protocol? It would be very useful for this paper and future metadata curation to have a consistent way of presenting single cell data like this!

bcorrie commented 4 years ago

@wyattmcdonnell do you want to comment on this as well? Are we getting this correct? We are working on curating the TCR data from the Liao COVID-19 paper in our repository.

wyattmcdonnell commented 4 years ago

Hi @bcorrie, @lcorrie, @bussec--happy to clarify. The assay configuration for the 10x VDJ assay can be found here (scroll all the way to the bottom). The enrichment outer and inner (reverse) primers target the TCR or BCR constant regions, and the forward primer anneals to p5/R1. As for splitting the repertoires... currently public-facing Cell Ranger should make this pretty straightforward. If you wanted to retain single cell-specific repertoires the formatting would be a bit different and I can help you use enclone (available here) to make that happen. This wouldn't be in AIRR format but it would be formatted a bit more sensibly than our current pipeline outputs.

bcorrie commented 4 years ago

@bussec it looks like we will then have a pcr_target_locus, one for each of the alpha and beta chains...

Make sense?

bussec commented 4 years ago

Yes. Also, as this is a nested PCR, you should use the inner primers for annotation. We also should probably put this in the Metadata annotation guidelines.

bcorrie commented 3 years ago

@bussec do you want to make the above change to the metadata annotation guidelines. Otherwise I think we can close this issue.

bcorrie commented 3 years ago

@bussec do you want to make the above changes? I won't use the right language. Would like to close this issue.

bussec commented 3 years ago

@bcorrie Just created a PR for this. The guideline does not mention whether data sets should be split or not, this is something I would need input from your side.

bcorrie commented 3 years ago

@bcorrie Just created a PR for this. The guideline does not mention whether data sets should be split or not, this is something I would need input from your side.

I am not sure the guidelines need to comment on this. In our case, we prefer to separate the TRA and TRB data so that it is possible to split the data out more easily, but one could easily have a repertoire with two pcr_target_locus values and have an associated set of rearrangements with mixed TRA and TRB data. I think both are valid according to the spec.

From our internal curation process, we prefer to split these out, but that is a "lab based" decision and I don't know if there is a right or a wrong is there?

schristley commented 3 years ago

@bcorrie Just created a PR for this. The guideline does not mention whether data sets should be split or not, this is something I would need input from your side.

I am not sure the guidelines need to comment on this. In our case, we prefer to separate the TRA and TRB data so that it is possible to split the data out more easily, but one could easily have a repertoire with two pcr_target_locus values and have an associated set of rearrangements with mixed TRA and TRB data. I think both are valid according to the spec.

From our internal curation process, we prefer to split these out, but that is a "lab based" decision and I don't know if there is a right or a wrong is there?

By split, I assume you mean putting chains from the same cell into different repertoires? I don't think that should be done. The Cell object can only point to a single repertoire_id so all the chains should be in the same repertoire. Likewise, the cell_id in Rearrangement doesn't have a well-defined uniqueness scope, so can you accurately use it to connect the TRA in one repertoire with the TRB in another repertoire?

It seems possible to split IG from TR loci though as presumably each single cell only has one (though there is evidence of weird biology where both are present).

bcorrie commented 3 years ago

That is a good point... I think in our case, for single cell data this would still be split, but at the SampleProcessing level. Same Repertoire (and repertoire_id) but different SampleProcessing (and therefore different sample_processing_id)...

bcorrie commented 3 years ago

Actually @schristley, @bussec, @javh would the above be correct. If you had a 10X single cell study with both Cell data (cellranger count) and Rearrangement data (cellranger vdj), would you typically have different SampleProcessing objects for the two processes in a single Repertoire? I assume so, but has anyone created a concrete representation of that given a real study. We have done the cellranger vdj pipeline, but not the gene expression side...

schristley commented 3 years ago

Actually @schristley, @bussec, @javh would the above be correct. If you had a 10X single cell study with both Cell data (cellranger count) and Rearrangement data (cellranger vdj), would you typically have different SampleProcessing objects for the two processes in a single Repertoire?

No, i don't think so. The pcr_target_locus is for the VDJ protocol, not for gene expression (RNA-seq) which generally does not target any specific loci. Likewise, the SampleProcessing object is designed for VDJ experimental protocols. For RNA-seq, there is already a separate standard MINSEQE

bcorrie commented 3 years ago

Created #485 to capture discussion around Repertoire metadata for Cells...

bussec commented 3 years ago

@bcorrie Can we accept #473 and close this ticket?