bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
983 stars 355 forks source link

Feature Request: TruSeq custom amplicon workflow #1837

Closed rjsicko closed 4 years ago

rjsicko commented 7 years ago

I want to use bcbio to process TruSeq Custom Amplicon (TSCA) data. However, I notice the performance of bcbio's variant2 workflow is not great (comparison to MiSeq reporter workflow below).

I compared bcbio calls for NA12878 to the MSR workflow and found it performs better than bcbio variant2 workflow (untrimmed or primer trimmed reads**, bwa aligned, ensemble calling with freebayes/gatk-haplotype/samtools).

I used the MSR bam files as input into a couple variant callers and get results close to MSR workflow. MSR workflow keeps primer sequences and aligns against target amplicon sequences and I believe soft-clips the primer sequences prior to variant calling.

** - I used scripts discussed in this paper to trim primers identified in the manifest file from the fastq files. Trimming primer sequences helps somewhat, but as implemented it is very slow.

Comparison image

Also discussed here.

ohofmann commented 7 years ago

I'm interested in this as well as we will have to run some amplicon samples soon, though I was looking at AstraZeneca's VarDict in Amplicon mode (https://github.com/AstraZeneca-NGS/VarDict/wiki/Amplicon-Mode-in-VarDict). Did you limit variant calls to the regions of the amplicons or call across the whole genome? The F-measure for bcbio/FreeBayes with trimmed reads is pretty much identical to the MSR workflow, but the drop in precision makes me wonder if this was for global calls or just calls in the Amplicon sequences.

rjsicko commented 7 years ago

I provided the target regions bed to bcbio so calls are limited to those regions. For the comparison numbers (rtg vcfeval) I used bcbios "callable bed" which is the intersection of the targets bed and calculated callable regions based on coverage. Yes trimming gets it close for this sample, but I think the extra FN is causing the precision drop. Trimming as I'm currently doing it (scripts from paper above) is also slow and outside of bcbio.

I too was interested in Vardict in amplicon mode but creation of the necessary bed file from the manifest (provides primer sequences not locations) was an issue. I did try VarDict within bcbio and using the MSR bam file but results were not good (all default parameters)

chapmanb commented 7 years ago

Bob; Thanks for the discussion on improving amplicon and panel sequencing. Are you able to share any of this NA12878 data for validation work? There might be some tweaks to bcbio alignment or variant calling we could do to improve sensitivity and specificity.

The data point that stands out to me is freebayes performing well on the MiSeq aligned data. This indicates that bwa-mem alignment heuristics might not be well tuned to this type of data. Have you tried using novoalign instead of bwa for the aligner to see how that impacts the results? Another idea is to force use of bwa-mem (tools_on: [bwa-mem]) or bwa aln (tools_off: [bwa-mem]):

http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#tweaking-defaults

This would help determine if the algorithm or alignment approach helps, or if this is all down to the trimming and other custom work that is done in the MiSeq pipeline. Thanks for helping look at this.

rjsicko commented 7 years ago

Hi Brad, Sure I can share NA12878 data. I emailed a google drive link with the files. Let me know if that works.

I attempted to use novoalign but it generated an error (I think the unlicensed version only supports single core) and I never pushed forward with it. The RTG alignment (RTG trimmed) did miss a few variants though.

I'll add I did find Katana which can soft-clip primer sequences from a bam file, but as with VarDict in amplicon mode, primer locations must be provided.

chapmanb commented 7 years ago

Bob; Thanks much for sharing the NA12878. As a starting point I've been working on replicating your analysis and I'm not getting similar numbers. I used the BED file in the drive folder (SCIDv2-2_targets_sorted_no_chr.bed) and I get a lot of off-target reads (65%) and smaller numbers than what you list above:

http://i.imgur.com/NNGoif8.png

Is this the right regions file or are there other tweaks in your bcbio run that I missed? Thanks again for helping share data for improving this.

rjsicko commented 7 years ago

The targets file looks right. The fastq I uploaded are from the 'align_prep' folder... I thought they were full fastq files just compressed and indexed by bcbio. In case they aren't what I thought they were, I've uploaded the original fastq files from the 'input' folder. I've also created checksum files for both trimmed and untrimmed fastq.

Looking at multiqc I got around ~70% on target.

Here is the yaml for the run

details:
  - files: [../input/12_S12_L001_R1_001.fastq.trim.fastq, ../input/12_S12_L001_R2_001.fastq.trim.fastq]
    description: NA12878-1
    metadata:
      sex: female
    analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      variantcaller: [freebayes, gatk-haplotype, samtools]
      mark_duplicates: false
      recalibrate: false
      realign: false
      coverage_interval: amplicon
      validate: giab-NA12878/truth_small_variants.vcf.gz
      validate_regions: giab-NA12878/truth_regions.bed
      variant_regions: ../input/SCIDv2-2_targets_sorted_no_chr.bed
      coverage: ../input/170213_SCID_Exons_w_20bp_Padding.sorted.b37.bed
      ensemble:
        numpass: 2
chapmanb commented 7 years ago

Bob; Thanks for the additional details. Apologies, the poor results I saw above were my silly error in forgetting to add mark_duplicates: false to the configuration. I've been working on trying to replicate the good results you got from pre-aligned BAMs with FreeBayes but have not been having much luck. I tried bcbio-based TruSeq trimming with cutadapt (https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/trim.py#L20) and a few different aligners but get pretty consistent results across all of them:

amplicon validation

This convinced me that the aligner or trimming is not going to do a ton, and I was still worried about the amount of offtarget so started to look at the differences between the MiSeq amplicon only alignments and bcbio whole genome alignments. Here's a BED file of all the off-target reads from bcbio which are not in amplicon regions:

https://gist.github.com/chapmanb/922f6f8ead66f6a65a04090a5d822f55

I looked in depth at one well-covered off-target region using BLAT and it appears as if we're getting better mapping to the non-target regions due to repeats. Here is one example where we hit with good identity to three regions, one of which is a target region while the other two are offtarget:

chr1:59096277-59096496

IDENTITY CHRO STRAND  START    END     SPAN
-------------------------------------------
100.0%    1   +   59096277  59096496    220
96.4%     1   +   33476194  33476413    220
100.0%    12  +   98610327  98610617    291

1       33476232        33476484        chr1:33476232:33476484:AK2.Cds.chr1.33476433.33476434:UserDefined       252     +

So MiSeqs approach of aligning to only amplicon regions forces these to map to the target region, leading to better coverage and the extra calls your seeing.

This is not a great approach, though, since you're forcing repetitive reads to map in amplicon regions even when they have better mapping elsewhere. This could easily lead to incorrect variant calls in the targets and is masking amplification of repetitive regions.

So this helps explain the differences you're seeing. Practically we could potentially improve detection if we do trimming, based on your analysis above, but we might need a different set of adapters to trim. What adapters did you use for your bcbio trimmed run above? This would get us close to the MiSeq results without forcing potentially off-target alignments and hopefully improve accuracy through bcbio. Does that sound reasonable? Thanks for the example cases and discussion.

rjsicko commented 7 years ago

Brad, thanks for taking a close look at this. Regarding the pre-aligned bam files, sorry I should have mentioned the freebayes variant calls were generated outside of bcbio. I tried providing the MSR bam files as input to bcbio but couldn't resolve the errors (maybe to do with the reference MSR used). For freebayes with pre-aligned bam I used defaults freebayes -f ../../../../reference/ucsc.hg19.fasta NA12878.bam

I have two thoughts on off-target reads. TruSeq uses Upstream Locus-Specific Oligo (ULSO) and Downstream Locus-Specific Oligo (DLSO) then a ligation reaction followed by amplification of the ligation product to target specific regions. Since the ULSO and DLSO should be unique to the target region, some of the off-target matches should be ruled out since the chemistry should not amplify off-target regions (idealized scenario, I know). The point I'm trying to make is maybe the imperfect match that is in the targeted region is correct even though alignment is better elsewhere (because of real variants in the targeted region). This seems to be the case with this sample since the extra calls are validated against GiaB (TP) and our FP rate is similar. That said, MSR does factor in off-target regions. The last 59 lines of the TSCA manifest file (on drive) shows regions that are known off-target possibilities.

For trimming, it is actually trimming the DLSO and ULSO from the reads, not universal adapters. I used scripts from this paper. The TSCA manifest file is used by the scripts (and probably would need to be by bcbio if this type of trimming was implemented). The paper deals with allele drop-out from TSCA by trimming DLSO and ULSO from the reads, alignment, variant calling, using the generated variants to go back to fastq and remove reads that had a variant in the oligo-binding region, another alignment round, another variant calling round (figure 2). I ran NA12878 through the process, but the few variants that were eliminated in the 'reads removed' vcf were TP by GiaB; it was the initial trimming that made the biggest difference. I think MSR's approach is leave the DLSO/ULSO and soft-clip them from the bam. Which maybe could be implemented in bcbio after alignment with Katana ?

thanks again for taking a look at this

chapmanb commented 7 years ago

Bob; Thanks for the details about the TruSeq approaches. I haven't dealt with this type of data so am learning as I go and this is really helpful.

What we'd ideally like to do here is tweak bcbio defaults or suggested amplicon settings to improve your sensitivity/specificity without implementing custom trimming and analysis specifically for TruSeq. All of the trimming and Katana options are bigger projects down the second path.

I tried a more general approach by allowing more multi-mappers so avoid the repeat issues but that also did not adjust sensitivity so maybe I'm going down the wrong path.

Would you be able to pass along the MiSeq VCF calls that are giving you the 124 TPs we're aiming at? This could help us focus in on the key differences between bcbio calls and MiSeq and help isolate a possible approach to retain these. Thanks again for the discussion and help debugging.

rjsicko commented 7 years ago

Brad, I've uploaded the VCF for this sample to drive: NA12878-1-miseq_b37_run1.vcf.gz I agree, tweaking bcbio settings would be ideal

chapmanb commented 7 years ago

Bob; Thanks for sending along this file. I ran some comparisons of this and a bcbio bwa/freebayes output against the GiaB 3.2.2 truth set. I'm getting different numbers than you but the MiSeq output slightly outperforms the bcbio calls with 120 TPs versus 118 TPs. When you look at the true positive call differences between the two, bcbio calls 6 correct positions that MiSeq misses (primarily indels) and MiSeq calls 8 positions that bcbio misses. This results in the 2 variant difference between the methods.

If I look at the calls bcbio misses, these are primarily (6 out of 8 cases) due to heterozygote/homozygote differences. MiSeq and the GiaB reference call these as homozygotes and bcbio/FreeBayes call them as homozygotes. Here is the MiSeq:

1       237038161       rs1770449       T       C       17254   PASS    .   GT:AD:DP:GQ:PL:VF:GQX   1/1:0,429:429:99:17254,1288,0:1:99

and FreeBayes calls:

1       237038161       .       T       C       5947.21 PASS   .  GT:GQ:DP:AD:RO:QR:AO:QA:GL      0/1:160:1597:1133,463:1133:36846:463:17283:-1074.19,0,-2834.96

The allele depths for these are pretty different, since MiSeq finds 0 reference alleles and 429 alternative alleles, which FreeBayes has 1133 reference bases and 463 alts. Hence the difference in calls.

If you look at the GiaB reference it has two different stories which could match either way. The ADALL parameter looks like homozygote (ADALL is "Net allele depths across all datasets") which AD indicates it could be a heterozygote (AD is "Net allele depths across all unfiltered datasets with called genotype"):

1       237038161       rs1770449       T       C       50      PASS    .      GT:DP:ADALL:AD:GQ:IGT:IPS:PS    1|1:858:0,357:77,413:505:1/1:.:PATMAT

So there is a bit of ambiguity in the truth set depending on how you filter reads.

In terms of practical ways forward, it looks like the MiSeq pipeline is doing some sort of read filtering that removes the reference calls in these cases, which is what gives it an improved sensitivity over bcbio. All of the reads at this position appear to have equivalent high mapping and reasonable base quality scores so I'm not sure what filters get applied to account for the difference. Do you have any insight from looking at the MiSeq approaches?

Thanks again for the files and discussion.

rjsicko commented 7 years ago

Brad, this variant at 1:237038161 is a perfect example of how primer clipping helps!

The IGV shots below all show MSR vs bcbio untrimmed vs bcbio trimmed in the 3 panels

soft clipped bases not displayed. Note: discrepant variant frequency shown for MSR vs untrimmed in coverage track chr1_237038161 soft clipped bases not displayed. Note: you can see the WT allele in bcbio untrimmed is actually part of the primer/probe for the overlapping amplicon chr1_237038161_2 soft-clipped bases displayed. MSR soft-clipped the primer/probe sequence from the reads to correct the allele balance, the bcbio trimmed bam had the primer/probe sequences trimmed from fastq files prior to input into bcbio chr1_237038161_3

I'm looking into the other differences between MSR and bcbio trimmed/untrimmed, but I did want to mention my 124 TPs from MSR includes 5 filtered variants. The default MSR filters were applied, so I used --all-records in RTG vcfeval. I figure I can try to filter false positives later.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  12
9   396749  rs2297077   T   G   131.51  LowGQ   AC=2;AF=1.0;AN=2;DP=4;QD=32.88;TI=NM_203447,NM_001193536,NM_001190458;GI=DOCK8,DOCK8,DOCK8;FC=Silent,Silent,Silent  GT:AD:DP:GQ:PL:VF:GQX   1/1:0,4:4:12.04:164,12,0:1.000:12
9   396750  rs2297078   C   A   132.51  LowGQ   AC=2;AF=1.0;AN=2;DP=4;QD=33.13;TI=NM_203447,NM_001193536,NM_001190458;GI=DOCK8,DOCK8,DOCK8;FC=Silent,Silent,Silent  GT:AD:DP:GQ:PL:VF:GQX   1/1:0,4:4:12.04:165,12,0:1.000:12
9   432140  .   CTTT    CT,C    19157   R8  AC=1,0;AF=0.5,0.0;AN=2;DP=771;QD=24.85;TI=NM_203447,NM_001193536,NM_001190458;GI=DOCK8,DOCK8,DOCK8;FC=Noncoding,Noncoding,Noncoding GT:AD:DP:GQ:PL:VF:GQX   0/1:464,157,150:1715:99:19157,0,15919,10085,16026,35578:0.398:99
11  118210424   .   TAC TACAC,TACACAC,T 9268    R8  AC=0,1,0;AF=0.0,0.5,0.0;AN=2;DP=546;QD=16.97;TI=NM_000732,NM_001040651;GI=CD3D,CD3D;FC=Noncoding,Noncoding  GT:AD:DP:GQ:PL:VF:GQX   0/2:274,159,9,104:695:99:9268,3590,14546,0,9427,19335,6102,7406,4843,19662:0.498:99
17  26861605    .   CA  CAA,C   8149    R8  AC=1,0;AF=0.5,0.0;AN=2;DP=1194;QD=6.82;TI=NM_003593;GI=FOXN1;FC=Noncoding   GT:AD:DP:GQ:PL:VF:GQX   0/1:854,73,267:1884:99:8149,0,15916,4951,6719,24238:0.285:99

thanks again for your time on this

mjafin commented 7 years ago

This is what VarDict has built in to deal with. I would push for Illumina to provide the coordinates for the primers as clipping the reads introduces other artefacts.

rjsicko commented 7 years ago

Miika, thanks for reminding me to get back to VarDict. I bailed on my idea to BLAST ULSO/DLSO from the TruSeq manifest file to generate a primer bed; I was concerned handling multiple hits and making the correct hit in the target region was used would turn into a big problem. I tried VarDict on a MSR aligned BAM and found it called fewer variants (rtg vcfeval below) than other callers so I wasn't convinced amplicon mode would improve the sensitivity.

Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
  240.000                102            102         21         29     0.8293       0.7786     0.8031
     None                112            111        695         19     0.1377       0.8550     0.2372
chapmanb commented 7 years ago

Bob; Thanks for the working through the trimming examples. It does seem like we need to do some form of removal for ULSO/DLSO linkers to avoid these remaining issues since there is no other clean way to exclude those regions downstream.

I attempted a couple of different trimming approaches using a brute force approach with all of the ULSO/DLSOs in a single file. cutadapt was too slow to manage this. skewer did work after some recompiling of the tool (since it defaults to only allowing 96 inputs) but just feeding the full list of *LSOs as trimming targets for all inputs results in too much trimming and a loss in coverage and sensitivity.

It seems like for TruSeq we really need a custom approach that does:

Does anyone know of any existing tools that do any of this we could use as part of bcbio? I'm currently stuck trying to think of a general way to tackle this that would also benefit other non-TruSeq projects, but open to suggestions.

Thanks again for all the discussion and help with this.

rdocking commented 7 years ago

We've used Cutadapt in the past for this kind of issue with primer-trimming. Unfortunately, this ends up being a bit slow when there are many different potential adapter/primer sequences to look for.

tommyau commented 7 years ago

BAMClipper is developed to clip ULSO/DLSO sequence from original SAM alignments. It needs a BAM file (e.g. BWA-MEM alignments of original reads) and a BEDPE file (e.g. converted from Illumina Manifest file) and returns a primer clipped BAM file. It was tested to work with our Illumina TruSight Myeloid Panel and thus should also work with other TruSeq Cancer or Custom Amplicon panels. Since only BEDPE is needed, it also works with any other panel having BEDPE, e.g. Qiagen GeneRead v2 panel was also tested to work.

Additional information is also described in Scientific Reports 7:1567 published this Monday. For example, it is much faster than Cutadapt (Supplementary Figure S3), particularly for larger number of amplicons.

rjsicko commented 7 years ago

I can confirm BAMClipper worked on my TSCA panel. Clipped BAM was the same as MSR

rjsicko commented 7 years ago

Hi Brad, I'm just coming back to this as I've gotten another batch of data that I'd like to run through bcbio.

Is there an easy way to incorporate BAMClipper into the bcbio pipeline? Alternatively, I think I can have a bcbio run with just alignment, then a script to soft-clip bcbio's BAMs with BAMClipper and finally a bcbio variant calling run with the BAMClipper output.

Thanks again, Bob

chapmanb commented 7 years ago

Tommy and Bob; Thanks so much for this and sorry for not responding earlier. I'd definitely like to include this in bcbio but haven't yet had time to put this in place. I'll try to make time soon to work on it. Thanks for BAMClipper and for encouraging us to support this in bcbio.

rjsicko commented 6 years ago

I just noticed that Tommy created a pipemode for BAMClipper. Leaving this here as that might make incorporating into bcbio easier.

chapmanb commented 6 years ago

Bob -- thanks again for continuing to ping about this. Apologies that it keeps slipping down the list of things to do with other on going work. It's definitely something we want but we haven't had any TruSeq projects internally to prioritize adding support yet. I still hope to tackle it and appreciate Tommy continuing to work on it.

rjsicko commented 6 years ago

Brad, no problem. I completely understand the priority being on internal projects utilizing bcbio.

zzupmc commented 4 years ago

Hi, trimprimers from fgbio (http://fulcrumgenomics.github.io/fgbio/tools/latest/TrimPrimers.html) seems do the same work with bamclipper.