benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

Evaluation on antigen receptor data #755

Closed rspreafico-vir closed 4 years ago

rspreafico-vir commented 5 years ago

Hi there,

congrats on this amazing tool! I was wondering whether you have had the chance of (or planning to) evaluating DADA2 on T-cell/B-cell receptor data. Antigen receptor amplicons are conceptually very much like 16S amplicons: one amplicon with much diversity into it. PacBio sequencing of antigen receptors is very appealing, but plagued by error rates. Since DADA2 now supports PacBio reads, I was wondering whether you had the chance to test TCR/BCR PacBio data (I don't have any to provide). This would expand the scope of DADA2 substantially to an entirely new and hot field.

Thanks

benjjneb commented 5 years ago

I'm definitely interested, but no at this time we don't have any concrete evaluation on antigen receptor data in progress or planned. Do you know of any publicly available relevant datsets? Or even a decent recent paper that gives an idea of what the current state of antigen receptor sequencing is?

rspreafico-vir commented 5 years ago

There isn't much AFAIK, part of the reason being that long-read error rates lead to poor trust in data. So if a computational method substantially removes or even eliminates true technical noise, there would be greater adoption.

This seems a good pre-print with Nanopore long-read sequencing of antigen receptors: https://www.biorxiv.org/content/10.1101/424945v1.full - it is done in single cells and there are also Illumina reads, thereby providing a "ground truth" thanks to Illumina-based error correction, as well as inter-Nanopore read error correction. One could extract the Nanopore data in bulk and see whether upon computational correction this approaches the ground truth.

NEB has an abstract on looking at ONT antigen-receptor data: http://cancerimmunolres.aacrjournals.org/content/7/2_Supplement/B046

There is a conference poster from PacBio as well: https://www.pacb.com/wp-content/uploads/SO-AGBT-2017-T-cell-Receptor-Profiling-Using-PacBio-Sequencing-of-SMARTer-Libraries.pdf

benjjneb commented 5 years ago

If anyone has some suggestions on good and recent antigen receptor sequencing papers using short-read techs, we are also interested in that. What tools are currently used to process that sort of data right now? Especially with respect to separating PCR/sequencing errors from biological variation.

rspreafico-vir commented 5 years ago

In the short-read world, often the amplicon is truly short - less than 50 nt. That is because with < 50 nt you can capture the full CDR3, which is the region of most variation in the TCR. Something like V4 for 16S. So with such a short amplicon, typically the assumption is no sequencing errors. If you want to get more insight, I would reach out to Adaptive Biotech. They are super responsive and their ImmunoSEQ assay is by far the leader in the TCR/BCR antigen receptor sequencing with short reads.

rspreafico-vir commented 4 years ago

Here you can find 2x250 MiSeq data: https://presto.readthedocs.io/en/version-0.5.11/workflows/Greiff2014_Workflow.html. I am planning to try dada2 on data similar to this dataset as I am not truly satisfied with TCR/BCR-specific packages. The only question I might have is whether the sequence variation introduced by genuine biological processes such as BCR hypermutation won't be erased by dada2, being mistakenly taken as errors.

benjjneb commented 4 years ago

The only question I might have is whether the sequence variation introduced by genuine biological processes such as BCR hypermutation won't be erased by dada2, being mistakenly taken as errors.

In principle, that shouldn't happen. In practice there is an effective sensitivity limit to detecting variants by dada2, e.g. must be present in at least two error-free reads, that will lead to very low abundance stuff not always being detected.

benjjneb commented 4 years ago

I ran through the example 2x250 MiSeq dataset suggested above and available at https://presto.readthedocs.io/en/latest/workflows/Greiff2014_Workflow.html

Here is the code, with basic results commented:

# https://presto.readthedocs.io/en/latest/workflows/Greiff2014_Workflow.html

library(dada2); packageVersion("dada2")
setwd("~/Desktop/antigen")
system("~/sratoolkit.2.8.1-2-mac64/bin/fastq-dump --split-files -X 25000 ERR346600")

fnF <- "ERR346600_1.fastq"
fnR <- "ERR346600_2.fastq"
plotQualityProfile(c(fnF, fnR))
# Read 1 is the lower quality, seems like R1/R2 weere swapped

filtF <- file.path("filtered", fnF)
filtR <- file.path("filtered", fnR)
filterAndTrim(fnF, filtF, fnR, filtR, 
              trimLeft=c(24, 24), # Remove NNNN + 20nt primers. There is 1/19 22nt forward primers though
              truncLen=c(215, 246), maxEE=2, verbose=TRUE)
# Kept ~90% of the reads

drpF <- derepFastq(filtF, verbose=TRUE) 
# Encountered 9480 unique sequences from 22405 total sequences read.
drpR <- derepFastq(filtR, verbose=TRUE) 
# Encountered 8708 unique sequences from 22405 total sequences read.
# Plenty of replicated sequences, looks good for running DADA2

ddF <- dada(drpF, selfConsist=TRUE, multi=TRUE)
ddR <- dada(drpR, selfConsist=TRUE, multi=TRUE)
plotErrors(ddF, nominalQ=TRUE)
plotErrors(ddR, nominalQ=TRUE)
# Error profiles look reasonable.

mm <- mergePairs(ddF, drpF, ddR, drpR, verbose=TRUE)
# 20827 paired-reads (in 581 unique pairings) successfully merged out of  (in 798 pairings) input.
# 95.5% of reads successfully merged, solid

bim <- isBimeraDenovo(mm, verbose=TRUE) # Identified 38 bimeras out of 581 input sequences.
sum(getUniques(mm)[bim]) # 277
# Very few chimeras

unq <- getUniques(mm)[!bim]
sq <- getSequences(unq)
table(nchar(sq))
# Decent looking length distro, focused in the 332-268 length range

dada2:::pfasta(sq[1:10])
# https://blast.ncbi.nlm.nih.gov/Blast.cgi against nt
# Yes, these look like Mus musculus VDJ sequences

I only considered basic diagnostics on whether the results look reasonable, and I didn't do any real validation of the biological validity of the results, since I know very little about what to expect from VDJ sequencing.

But in short, this all looks good from the DADA2 side. Given these results, I don't see why you couldn't use DADA2 on this kind of data successfully. It seems to fit nicely into the type of data DADA2 was designed for, amplicon data w/ lots of diversity in it that is not known a priori.

rspreafico-vir commented 4 years ago

Thanks @benjjneb! I will take a look as soon as I have some time. It does sound promising!

benjjneb commented 4 years ago

Closing this as it appears to be working already. Happy to hear more feedback on the use of DADA2 with antigen data though. It seems like an interesting and useful application that hasn't been thoroughly explored yet.