airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Demultiplexing of data with internal barcodes #98

Closed bussec closed 1 year ago

bussec commented 6 years ago

If I remember correctly, we agreed that experimental protocols using

As this data would be submitted as data set 4 (raw reads), I assume that the reads are not modified (i.e. the barcodes will not be removed). Is this correct?

schristley commented 6 years ago

When doing de-multiplexing with pRESTO or VDJPipe, the bar codes are commonly removed during the process, though the tools have the option to leave them.

bussec commented 6 years ago

Ok, then I will add a note in the documentation saying that all multiplexing with known barcodes is considered to be an neglectable source of assignment error [1] and that therefore these barcodes SHOULD be removed from the raw reads, but CAN be reported as separate read files. And we will of course remove the barcode from our reads...

[1] "Barcode spreading" and similar artifacts have been shown to arise in the wet lab

javh commented 6 years ago

I wasn't part of the original discussion, so I'm not sure what minimal standards decided. Dunno if this helps, but...

My understanding of what SRA requires is that they want the data deposited to be as "raw" as possible, with any data processing documented in the "Experiment Pipeline" section of the SRA experiment.

For example, if demultiplexed by the core using Illumina's software, report that in the pipeline, like so: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1383470

(Click Show Experiment Pipeline, which just lists Casava 1.8.)

So, I think if you have multiple biological samples in a single sequencing run, you'll need to demultiplex them somehow and document the steps so that a single BioSample doesn't contain multiple biologically distinct samples. But, you can put as many experiments/runs into a BioSample as you want, so my preference would be to only submit demultiplexed files as necessary to divide BioSamples.

However, IIRC, SRA recommends treating all mice of the same strain and treatment as the same BioSample but separate experiments, even if they are actually separate individual mice.

Example: https://www.ncbi.nlm.nih.gov/sra?term=SAMN03653678

bussec commented 6 years ago

Originating from a discussion regrading physical data representation, @laserson, @schristley and me touched again on this topic. For recap, the general assumptions/procedures are:

  1. An individual run (FASTQ file/SRR record) is derived from a single subject + sample + library
  2. Runs that contain do not fulfill these criteria must be demultiplexed before submission
  3. Barcodes are removed upon demultiplexing but CAN be reported as separate runs

DataRep will currently not attempt to represent data with higher degrees of complexity.

While the demultiplexing will modify the data, we consider this not to be a likely source of error as long as the multiplexing barcodes are physically attached to the reported amplicons. While this procedure will be borderline for experiments that require "deep" demultiplexing (especially true for single-cell applications), it should still be feasible.

However, we have to keep an eye on this issue as techniques like Cite-seq and index sorting start to be used for multiplexing and can classify reads based on external data.

schristley commented 6 years ago

@bussec, to clear up some confusion in my mind, does "internal barcode" include the multiplexing of different samples in the sequencing run, or does it only refer to situations like single-cell and UMI?

laserson commented 6 years ago

Also note that the rearrangement format does include a place to annotate a cell index, which would correspond to such a barcode.

bussec commented 6 years ago

@schristley, "internal barcodes" refers to BCs that are not resolved by the sequencer (in contrast to the "external" BCs that are typically ligated to the DNA during library prep and are used for the first layer of multiplexing). The most frequent use case for internal BCs would be cell IDs and UMIs, but it's mainly the cell IDs that pose a problem, as UMIs are primarily used for quantification and error-correction and not for multiplexing.

scharch commented 1 year ago

closed as stale