airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

MiAIRR target_substrate clarification (field has been renamed to "template_class") #28

Closed bcorrie closed 6 years ago

bcorrie commented 7 years ago

Again, as a result of our data curators diving into the MiAIRR standard in more detail, another clarification is requested 8-) Apologies if this is confusing given I am not an expert. Nishanth, please jump in if you can make my incoherent ramblings more succinct!!!

The questions I have to pass along are in regards to:

3 / process (nucl. acid) Target substrate String Controlled vocabulary (DNA|RNA)

One of the drivers for the question is the field is defined as a controlled vocabulary with only two options in the example, DNA or RNA. This implies that there are two, and only two possibilities in the controlled vocabulary. Is that the case? This causes some confusion on our end, in particular because the data our curators tend to gather the "DNA Type" from papers, which is whether the study uses gDNA or cDNA. Given that cDNA and gDNA are not a possibility for target_substrate and there is no place to capture that information, they have requested some clarification around this...

More specific clarification questions are:

" In what context is a 'target_substrate' a target? Is it the target for sequencing (eg. cDNA being sequenced) or for calculating viral load?" and a follow on comment "Is 'target_substrate' for those that are meant only for sequencing or other processing as well? For example, for sequencing, cDNA or gDNA makes more sense while for calculating viral load, RNA would be the target_substrate. And for RNA-seq it is DNA that is sequenced even though the source for the cDNA is RNA. So should RNA be the target substrate for that or DNA? Should the target substrate be defined more specifically to what is actually used to prep the assay specific sample (in this case, assay specific sample being the cDNA library prepared using Illumina adapters for sequencing)."

Comments?

lgcowell commented 7 years ago

I think originally gDNA and cDNA were the choices, and this must have gotten changed somewhere along the way. I suppose DNA maps to gDNA and RNA maps to cDNA, but gDNA and cDNA are probably more accurate/specific.

On the other question, for MiAIRR, wouldn't we only be considering AIRR-seq and not things like viral load calculation?

bussec commented 7 years ago

I think our original reasoning (gDNA = DNA, cDNA = RNA) should be revisited in the light of a number of recent publications (e.g. by @mstubb, @chudakovdm and PC Wilson), which extract TCR or Ig sequences from whole RNA-seq data. While technically there is of course a reverse-transcription step (i.e. cDNA generation) somewhere in the process, it is not necessarily associated with an actual PCR amplification. Therefore I would propose to extend the vocabulary to: gDNA, cDNA, RNA. Happy to get further input on this from @scharch and @mikessh.

Otherwise I agree with @lgcowell, MiAIRR describes only the template used for AIRR-seq. If a user determines other parameters (like viral load) from the same sample, this would either be property of the sample or the clinical history.

mikessh commented 7 years ago

Well, we use the following strategy in how we store sample metadata right now: this all goes to "technology/method" field, so we have sanger, amplicon-seq (further clarified as either 5'/3'RACE or multiplex PCR), single-cell and rna-seq.

Note that this information is critical, for example different V/J usage patterns are expected for RACE and multiplex PCR; RNA-seq can also result in biases in final CDR3 length distribution (spectratype) as we have to assemble contigs in this case, especially for short reads.

P.S. I think that RNA/DNA field is more relevant to the possibility of expression biases and number of noncoding sequences one would expect in a sample

chudakovdm commented 7 years ago

I would add that, furthermore, RNA-seq can be 5'RACE-based (e.g. Smarter kit) or random. And yes, sequencing length is critical for RNA-seq. Still the main division in respect of starting material is RNA versus DNA, since the former depends on expression, not much for TCRs but dramatically for B cells. RNA vs DNA based IG profiling are the two different worlds. Also the amount available, sample types and extraction procedures all depend on RNA vs DNA. Amount of starting material means different things in RNA vs DNA again. I would start from sample type (pbmc etc tissues/sorted cells and which and how/single cells/parafin blocks etc) Next  just RNA vs DNA division, quantity (and may be quality). Next method of library prep (targeted 5' Race or multiplex amplicons, with or without UMI/RNAseq and which RNAseq and what about ribosomal RNA depletion/exome-seq/etc). Next goes method of library sequencing. Best, Dmitriy

bussec commented 6 years ago

Note that the field/key under discussion has been renamed to template_class in commit 0e34ecddce9ee6d36f2c69b408af2ab42e8b7b34.

bussec commented 6 years ago

We will have a fairly exhaustive list of terms for the library_generation_method field. Therefore template_class should only describe the original material extracted from the cell. The distinction between cDNA and RNA will be resolved in the library_generation_* fields.

Closing this issue now. Please re-open in case your are not happy with this solution.