Closed rjsicko closed 4 years ago
I'm interested in this as well as we will have to run some amplicon samples soon, though I was looking at AstraZeneca's VarDict in Amplicon mode (https://github.com/AstraZeneca-NGS/VarDict/wiki/Amplicon-Mode-in-VarDict). Did you limit variant calls to the regions of the amplicons or call across the whole genome? The F-measure for bcbio/FreeBayes with trimmed reads is pretty much identical to the MSR workflow, but the drop in precision makes me wonder if this was for global calls or just calls in the Amplicon sequences.
I provided the target regions bed to bcbio so calls are limited to those regions. For the comparison numbers (rtg vcfeval) I used bcbios "callable bed" which is the intersection of the targets bed and calculated callable regions based on coverage. Yes trimming gets it close for this sample, but I think the extra FN is causing the precision drop. Trimming as I'm currently doing it (scripts from paper above) is also slow and outside of bcbio.
I too was interested in Vardict in amplicon mode but creation of the necessary bed file from the manifest (provides primer sequences not locations) was an issue. I did try VarDict within bcbio and using the MSR bam file but results were not good (all default parameters)
Bob; Thanks for the discussion on improving amplicon and panel sequencing. Are you able to share any of this NA12878 data for validation work? There might be some tweaks to bcbio alignment or variant calling we could do to improve sensitivity and specificity.
The data point that stands out to me is freebayes performing well on the MiSeq aligned data. This indicates that bwa-mem alignment heuristics might not be well tuned to this type of data. Have you tried using novoalign instead of bwa for the aligner
to see how that impacts the results? Another idea is to force use of bwa-mem (tools_on: [bwa-mem]
) or bwa aln (tools_off: [bwa-mem]
):
http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#tweaking-defaults
This would help determine if the algorithm or alignment approach helps, or if this is all down to the trimming and other custom work that is done in the MiSeq pipeline. Thanks for helping look at this.
Hi Brad, Sure I can share NA12878 data. I emailed a google drive link with the files. Let me know if that works.
I attempted to use novoalign but it generated an error (I think the unlicensed version only supports single core) and I never pushed forward with it. The RTG alignment (RTG trimmed) did miss a few variants though.
I'll add I did find Katana which can soft-clip primer sequences from a bam file, but as with VarDict in amplicon mode, primer locations must be provided.
Bob;
Thanks much for sharing the NA12878. As a starting point I've been working on replicating your analysis and I'm not getting similar numbers. I used the BED file in the drive folder (SCIDv2-2_targets_sorted_no_chr.bed
) and I get a lot of off-target reads (65%) and smaller numbers than what you list above:
http://i.imgur.com/NNGoif8.png
Is this the right regions file or are there other tweaks in your bcbio run that I missed? Thanks again for helping share data for improving this.
The targets file looks right. The fastq I uploaded are from the 'align_prep' folder... I thought they were full fastq files just compressed and indexed by bcbio. In case they aren't what I thought they were, I've uploaded the original fastq files from the 'input' folder. I've also created checksum files for both trimmed and untrimmed fastq.
Looking at multiqc I got around ~70% on target.
Here is the yaml for the run
details:
- files: [../input/12_S12_L001_R1_001.fastq.trim.fastq, ../input/12_S12_L001_R2_001.fastq.trim.fastq]
description: NA12878-1
metadata:
sex: female
analysis: variant2
genome_build: GRCh37
algorithm:
aligner: bwa
variantcaller: [freebayes, gatk-haplotype, samtools]
mark_duplicates: false
recalibrate: false
realign: false
coverage_interval: amplicon
validate: giab-NA12878/truth_small_variants.vcf.gz
validate_regions: giab-NA12878/truth_regions.bed
variant_regions: ../input/SCIDv2-2_targets_sorted_no_chr.bed
coverage: ../input/170213_SCID_Exons_w_20bp_Padding.sorted.b37.bed
ensemble:
numpass: 2
Bob;
Thanks for the additional details. Apologies, the poor results I saw above were my silly error in forgetting to add mark_duplicates: false
to the configuration. I've been working on trying to replicate the good results you got from pre-aligned BAMs with FreeBayes but have not been having much luck. I tried bcbio-based TruSeq trimming with cutadapt (https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/trim.py#L20) and a few different aligners but get pretty consistent results across all of them:
This convinced me that the aligner or trimming is not going to do a ton, and I was still worried about the amount of offtarget so started to look at the differences between the MiSeq amplicon only alignments and bcbio whole genome alignments. Here's a BED file of all the off-target reads from bcbio which are not in amplicon regions:
https://gist.github.com/chapmanb/922f6f8ead66f6a65a04090a5d822f55
I looked in depth at one well-covered off-target region using BLAT and it appears as if we're getting better mapping to the non-target regions due to repeats. Here is one example where we hit with good identity to three regions, one of which is a target region while the other two are offtarget:
chr1:59096277-59096496
IDENTITY CHRO STRAND START END SPAN
-------------------------------------------
100.0% 1 + 59096277 59096496 220
96.4% 1 + 33476194 33476413 220
100.0% 12 + 98610327 98610617 291
1 33476232 33476484 chr1:33476232:33476484:AK2.Cds.chr1.33476433.33476434:UserDefined 252 +
So MiSeqs approach of aligning to only amplicon regions forces these to map to the target region, leading to better coverage and the extra calls your seeing.
This is not a great approach, though, since you're forcing repetitive reads to map in amplicon regions even when they have better mapping elsewhere. This could easily lead to incorrect variant calls in the targets and is masking amplification of repetitive regions.
So this helps explain the differences you're seeing. Practically we could potentially improve detection if we do trimming, based on your analysis above, but we might need a different set of adapters to trim. What adapters did you use for your bcbio trimmed run above? This would get us close to the MiSeq results without forcing potentially off-target alignments and hopefully improve accuracy through bcbio. Does that sound reasonable? Thanks for the example cases and discussion.
Brad, thanks for taking a close look at this. Regarding the pre-aligned bam files, sorry I should have mentioned the freebayes variant calls were generated outside of bcbio. I tried providing the MSR bam files as input to bcbio but couldn't resolve the errors (maybe to do with the reference MSR used). For freebayes with pre-aligned bam I used defaults freebayes -f ../../../../reference/ucsc.hg19.fasta NA12878.bam
I have two thoughts on off-target reads. TruSeq uses Upstream Locus-Specific Oligo (ULSO) and Downstream Locus-Specific Oligo (DLSO) then a ligation reaction followed by amplification of the ligation product to target specific regions. Since the ULSO and DLSO should be unique to the target region, some of the off-target matches should be ruled out since the chemistry should not amplify off-target regions (idealized scenario, I know). The point I'm trying to make is maybe the imperfect match that is in the targeted region is correct even though alignment is better elsewhere (because of real variants in the targeted region). This seems to be the case with this sample since the extra calls are validated against GiaB (TP) and our FP rate is similar. That said, MSR does factor in off-target regions. The last 59 lines of the TSCA manifest file (on drive) shows regions that are known off-target possibilities.
For trimming, it is actually trimming the DLSO and ULSO from the reads, not universal adapters. I used scripts from this paper. The TSCA manifest file is used by the scripts (and probably would need to be by bcbio if this type of trimming was implemented). The paper deals with allele drop-out from TSCA by trimming DLSO and ULSO from the reads, alignment, variant calling, using the generated variants to go back to fastq and remove reads that had a variant in the oligo-binding region, another alignment round, another variant calling round (figure 2). I ran NA12878 through the process, but the few variants that were eliminated in the 'reads removed' vcf were TP by GiaB; it was the initial trimming that made the biggest difference. I think MSR's approach is leave the DLSO/ULSO and soft-clip them from the bam. Which maybe could be implemented in bcbio after alignment with Katana ?
thanks again for taking a look at this
Bob; Thanks for the details about the TruSeq approaches. I haven't dealt with this type of data so am learning as I go and this is really helpful.
What we'd ideally like to do here is tweak bcbio defaults or suggested amplicon settings to improve your sensitivity/specificity without implementing custom trimming and analysis specifically for TruSeq. All of the trimming and Katana options are bigger projects down the second path.
I tried a more general approach by allowing more multi-mappers so avoid the repeat issues but that also did not adjust sensitivity so maybe I'm going down the wrong path.
Would you be able to pass along the MiSeq VCF calls that are giving you the 124 TPs we're aiming at? This could help us focus in on the key differences between bcbio calls and MiSeq and help isolate a possible approach to retain these. Thanks again for the discussion and help debugging.
Brad, I've uploaded the VCF for this sample to drive: NA12878-1-miseq_b37_run1.vcf.gz
I agree, tweaking bcbio settings would be ideal
Bob; Thanks for sending along this file. I ran some comparisons of this and a bcbio bwa/freebayes output against the GiaB 3.2.2 truth set. I'm getting different numbers than you but the MiSeq output slightly outperforms the bcbio calls with 120 TPs versus 118 TPs. When you look at the true positive call differences between the two, bcbio calls 6 correct positions that MiSeq misses (primarily indels) and MiSeq calls 8 positions that bcbio misses. This results in the 2 variant difference between the methods.
If I look at the calls bcbio misses, these are primarily (6 out of 8 cases) due to heterozygote/homozygote differences. MiSeq and the GiaB reference call these as homozygotes and bcbio/FreeBayes call them as homozygotes. Here is the MiSeq:
1 237038161 rs1770449 T C 17254 PASS . GT:AD:DP:GQ:PL:VF:GQX 1/1:0,429:429:99:17254,1288,0:1:99
and FreeBayes calls:
1 237038161 . T C 5947.21 PASS . GT:GQ:DP:AD:RO:QR:AO:QA:GL 0/1:160:1597:1133,463:1133:36846:463:17283:-1074.19,0,-2834.96
The allele depths for these are pretty different, since MiSeq finds 0 reference alleles and 429 alternative alleles, which FreeBayes has 1133 reference bases and 463 alts. Hence the difference in calls.
If you look at the GiaB reference it has two different stories which could match either way. The ADALL
parameter looks like homozygote (ADALL is "Net allele depths across all datasets") which AD
indicates it could be a heterozygote (AD is "Net allele depths across all unfiltered datasets with called genotype"):
1 237038161 rs1770449 T C 50 PASS . GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|1:858:0,357:77,413:505:1/1:.:PATMAT
So there is a bit of ambiguity in the truth set depending on how you filter reads.
In terms of practical ways forward, it looks like the MiSeq pipeline is doing some sort of read filtering that removes the reference calls in these cases, which is what gives it an improved sensitivity over bcbio. All of the reads at this position appear to have equivalent high mapping and reasonable base quality scores so I'm not sure what filters get applied to account for the difference. Do you have any insight from looking at the MiSeq approaches?
Thanks again for the files and discussion.
Brad, this variant at 1:237038161 is a perfect example of how primer clipping helps!
The IGV shots below all show MSR vs bcbio untrimmed vs bcbio trimmed in the 3 panels
soft clipped bases not displayed. Note: discrepant variant frequency shown for MSR vs untrimmed in coverage track
soft clipped bases not displayed. Note: you can see the WT allele in bcbio untrimmed is actually part of the primer/probe for the overlapping amplicon
soft-clipped bases displayed. MSR soft-clipped the primer/probe sequence from the reads to correct the allele balance, the bcbio trimmed bam had the primer/probe sequences trimmed from fastq files prior to input into bcbio
I'm looking into the other differences between MSR and bcbio trimmed/untrimmed, but I did want to mention my 124 TPs from MSR includes 5 filtered variants. The default MSR filters were applied, so I used --all-records in RTG vcfeval. I figure I can try to filter false positives later.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 12
9 396749 rs2297077 T G 131.51 LowGQ AC=2;AF=1.0;AN=2;DP=4;QD=32.88;TI=NM_203447,NM_001193536,NM_001190458;GI=DOCK8,DOCK8,DOCK8;FC=Silent,Silent,Silent GT:AD:DP:GQ:PL:VF:GQX 1/1:0,4:4:12.04:164,12,0:1.000:12
9 396750 rs2297078 C A 132.51 LowGQ AC=2;AF=1.0;AN=2;DP=4;QD=33.13;TI=NM_203447,NM_001193536,NM_001190458;GI=DOCK8,DOCK8,DOCK8;FC=Silent,Silent,Silent GT:AD:DP:GQ:PL:VF:GQX 1/1:0,4:4:12.04:165,12,0:1.000:12
9 432140 . CTTT CT,C 19157 R8 AC=1,0;AF=0.5,0.0;AN=2;DP=771;QD=24.85;TI=NM_203447,NM_001193536,NM_001190458;GI=DOCK8,DOCK8,DOCK8;FC=Noncoding,Noncoding,Noncoding GT:AD:DP:GQ:PL:VF:GQX 0/1:464,157,150:1715:99:19157,0,15919,10085,16026,35578:0.398:99
11 118210424 . TAC TACAC,TACACAC,T 9268 R8 AC=0,1,0;AF=0.0,0.5,0.0;AN=2;DP=546;QD=16.97;TI=NM_000732,NM_001040651;GI=CD3D,CD3D;FC=Noncoding,Noncoding GT:AD:DP:GQ:PL:VF:GQX 0/2:274,159,9,104:695:99:9268,3590,14546,0,9427,19335,6102,7406,4843,19662:0.498:99
17 26861605 . CA CAA,C 8149 R8 AC=1,0;AF=0.5,0.0;AN=2;DP=1194;QD=6.82;TI=NM_003593;GI=FOXN1;FC=Noncoding GT:AD:DP:GQ:PL:VF:GQX 0/1:854,73,267:1884:99:8149,0,15916,4951,6719,24238:0.285:99
thanks again for your time on this
This is what VarDict has built in to deal with. I would push for Illumina to provide the coordinates for the primers as clipping the reads introduces other artefacts.
Miika, thanks for reminding me to get back to VarDict. I bailed on my idea to BLAST ULSO/DLSO from the TruSeq manifest file to generate a primer bed; I was concerned handling multiple hits and making the correct hit in the target region was used would turn into a big problem. I tried VarDict on a MSR aligned BAM and found it called fewer variants (rtg vcfeval below) than other callers so I wasn't convinced amplicon mode would improve the sensitivity.
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
240.000 102 102 21 29 0.8293 0.7786 0.8031
None 112 111 695 19 0.1377 0.8550 0.2372
Bob; Thanks for the working through the trimming examples. It does seem like we need to do some form of removal for ULSO/DLSO linkers to avoid these remaining issues since there is no other clean way to exclude those regions downstream.
I attempted a couple of different trimming approaches using a brute force approach with all of the ULSO/DLSOs in a single file. cutadapt was too slow to manage this. skewer did work after some recompiling of the tool (since it defaults to only allowing 96 inputs) but just feeding the full list of *LSOs as trimming targets for all inputs results in too much trimming and a loss in coverage and sensitivity.
It seems like for TruSeq we really need a custom approach that does:
Does anyone know of any existing tools that do any of this we could use as part of bcbio? I'm currently stuck trying to think of a general way to tackle this that would also benefit other non-TruSeq projects, but open to suggestions.
Thanks again for all the discussion and help with this.
We've used Cutadapt in the past for this kind of issue with primer-trimming. Unfortunately, this ends up being a bit slow when there are many different potential adapter/primer sequences to look for.
BAMClipper is developed to clip ULSO/DLSO sequence from original SAM alignments. It needs a BAM file (e.g. BWA-MEM alignments of original reads) and a BEDPE file (e.g. converted from Illumina Manifest file) and returns a primer clipped BAM file. It was tested to work with our Illumina TruSight Myeloid Panel and thus should also work with other TruSeq Cancer or Custom Amplicon panels. Since only BEDPE is needed, it also works with any other panel having BEDPE, e.g. Qiagen GeneRead v2 panel was also tested to work.
Additional information is also described in Scientific Reports 7:1567 published this Monday. For example, it is much faster than Cutadapt (Supplementary Figure S3), particularly for larger number of amplicons.
I can confirm BAMClipper worked on my TSCA panel. Clipped BAM was the same as MSR
Hi Brad, I'm just coming back to this as I've gotten another batch of data that I'd like to run through bcbio.
Is there an easy way to incorporate BAMClipper into the bcbio pipeline? Alternatively, I think I can have a bcbio run with just alignment, then a script to soft-clip bcbio's BAMs with BAMClipper and finally a bcbio variant calling run with the BAMClipper output.
Thanks again, Bob
Tommy and Bob; Thanks so much for this and sorry for not responding earlier. I'd definitely like to include this in bcbio but haven't yet had time to put this in place. I'll try to make time soon to work on it. Thanks for BAMClipper and for encouraging us to support this in bcbio.
I just noticed that Tommy created a pipemode for BAMClipper. Leaving this here as that might make incorporating into bcbio easier.
Bob -- thanks again for continuing to ping about this. Apologies that it keeps slipping down the list of things to do with other on going work. It's definitely something we want but we haven't had any TruSeq projects internally to prioritize adding support yet. I still hope to tackle it and appreciate Tommy continuing to work on it.
Brad, no problem. I completely understand the priority being on internal projects utilizing bcbio.
Hi, trimprimers from fgbio (http://fulcrumgenomics.github.io/fgbio/tools/latest/TrimPrimers.html) seems do the same work with bamclipper.
I want to use bcbio to process TruSeq Custom Amplicon (TSCA) data. However, I notice the performance of bcbio's variant2 workflow is not great (comparison to MiSeq reporter workflow below).
I compared bcbio calls for NA12878 to the MSR workflow and found it performs better than bcbio variant2 workflow (untrimmed or primer trimmed reads**, bwa aligned, ensemble calling with freebayes/gatk-haplotype/samtools).
I used the MSR bam files as input into a couple variant callers and get results close to MSR workflow. MSR workflow keeps primer sequences and aligns against target amplicon sequences and I believe soft-clips the primer sequences prior to variant calling.
** - I used scripts discussed in this paper to trim primers identified in the manifest file from the fastq files. Trimming primer sequences helps somewhat, but as implemented it is very slow.
Comparison![image](https://cloud.githubusercontent.com/assets/7218576/23414392/b8eb6296-fda9-11e6-8d44-ce6333840fa7.png)
Also discussed here.