Open arecto0 opened 1 month ago
Hi @arecto0, thanks for your interest in the workflow!
I am running the wf-amplicon on PCR amplicons covering a complete virus genome (4 overlapping long amplicons per sample, pooled in one barcode) and map the reads to a reference genome of this virus to determine the viral genomic sequence in my sample
Could you please clarify: are you running the workflow without a reference and then manually map the reads against your reference or do you provide the reference when running the workflow as well?
Also please note that generally wf-amplicon has not been designed for tiled amplicon schemes.
I run the workflow with a reference (so in variant callling mode). To easily check whether the consensus that is generated is in accordance with the reads, I run the wf-amplicon again using the consensus that was generated in the first run as reference. I would expect the consensus to reflect the majority of variants present in the reads, but this is not always the case.
Would you recommend using another Epi2ME workflow for analysis of tiled amplicon sequence data?
I run the wf-amplicon again using the consensus that was generated in the first run as reference. I would expect the consensus to reflect the majority of variants present in the reads, but this is not always the case.
Sorry, I still don't fully understand. Are you speaking about the first or second consensus? Also, are you viewing the BAMs created by the workflow in IGV or aligning the raw reads against the consensus yourself (this makes a difference because my suspicion is that your issue is related to how reads are filtered during pre-processing)?
In general, when running variant calling mode with several amplicons per sample we recommend setting --drop_frac_longest_reads 0 --take_longest_remaining_reads false
. You could give this one a try and see if the problem persists.
Would you recommend using another Epi2ME workflow for analysis of tiled amplicon sequence data?
Unfortunately there is no dedicated workflow for tiled amplicons at the moment.
I view the BAMs created by the workflow, so I do NOT align the raw reads against the consensus myself. The consensus generated by the first wf-amplicon run (using a more general reference) and the second wf-amplicon (using the consensus of the first run as reference) do not show any differences.
I have now tried the wf-amplicon without a reference, and then using the de novo generated consensus as reference in a new wf-amplicon run in variant calling mode, and this seems to work better.
OK, got it now. You are seeing variants in the BAM output by the wf that are not reflected in the VCF / consensus. Could you please share the reference used in the first wf-amplicon run alongside the output files (VCF, consensus FASTA, BAM) for one of the samples where you see this issue? Thanks.
I have now tried the wf-amplicon without a reference, Please not that de novo mode has not been tested with tiled amplicons and might give sub-optimal results
medaka.consensus.fasta.txt reference_sanitized_seqIDs.fasta.txt
I cannot seem to upload the VCF file and the bam file here...
BAM file does not upload...
Could not upload these so I am trying like this…
From: Julian Libiseller-Egger @.> Sent: woensdag 7 augustus 2024 17:54 To: epi2me-labs/wf-amplicon @.> Cc: Annabel Rector @.>; Mention @.> Subject: Re: [epi2me-labs/wf-amplicon] Consensus sequence generated by wf-amplicon in variant calling mode does not reflect majority of reads as seen in igv (Issue #23)
OK, so you're variants in the BAM that are not reflected in the VCF / consensus. Could you please share the reference used in the first wf-amplicon run alongside the output files (VCF, consensus FASTA, BAM) for one of the samples where you see this issue? Thanks.
I have now tried the wf-amplicon without a reference, Please not that de novo mode has not been tested with tiled amplicons and might give sub-optimal results
— Reply to this email directly, view it on GitHubhttps://github.com/epi2me-labs/wf-amplicon/issues/23#issuecomment-2273798077, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BKIX4GRIG26EZP4B3NGOOATZQI7JBAVCNFSM6AAAAABMERF2R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTG44TQMBXG4. You are receiving this because you were mentioned.Message ID: @.**@.>> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/epi2me-labs/wf-amplicon/issues/23#issuecomment-2273798077", "url": "https://github.com/epi2me-labs/wf-amplicon/issues/23#issuecomment-2273798077", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
just had a look and it appears the variants listed in the VCF are indeed present in the consensus. If you can't share the BAM (it might be too big), could you instead send a screenshot of the IGV view (of the first BAM with the reference) with a variant that you are missing from the consensus?
This is the problem I referred to: when I run the wf-amplicon again, using the consensus that was generated in the first run, and then look at the BAM file, I see that in some positions the consensus does not reflect what is in the reads.
Hmm, hard to tell what's going on there without the input data or generated BAM. How big is your BAM? The size limit for files on github is 25MB. You could try sub-sampling the BAM with samtools view
to a smaller size and upload this instead.
In the VCF there are actually variants in this region, but they didn't have enough depth to make it past the low depth filter (the threshold is 20 per default)
HRSVB_MZ515836_CG_ 5300 . C T 43.644 LOW_DEPTH AR=0,1;DP=19;DPS=3,16;DPSP=19;SC=683,3648,710,3783;SR=0,0,3,15 GT:GQ 1:44
HRSVB_MZ515836_CG_ 5322 . T G 55.715 LOW_DEPTH AR=0,0;DP=19;DPS=3,16;DPSP=19;SC=597,3201,624,3345;SR=0,0,3,16 GT:GQ 1:56
HRSVB_MZ515836_CG_ 5325 . C T 40.594 LOW_DEPTH AR=0,0;DP=19;DPS=3,16;DPSP=19;SC=592,3181,619,3325;SR=0,0,3,16 GT:GQ 1:41
HRSVB_MZ515836_CG_ 5327 . C T 44.233 LOW_DEPTH AR=0,0;DP=19;DPS=3,16;DPSP=19;SC=607,3261,634,3405;SR=0,0,3,16 GT:GQ 1:44
HRSVB_MZ515836_CG_ 5335 . C T 43.265 LOW_DEPTH AR=0,0;DP=19;DPS=3,16;DPSP=19;SC=602,3266,629,3410;SR=0,0,3,16 GT:GQ 1:43
aligned.sorted.zip It worked by zipping, should have tried that earlier...
thanks; is this the BAM from the original run (with the original reference)? If not, could you please share this as well?
This is the bam I just uploaded. I will also upload the bam from the second run, with the consensus as reference. aligned.sorted.zip
ah ok, gotcha. I'm fairly confident now that the pre-Medaka downsampling is the cause of the issue.
Some background info: Medaka's performance degrades at high coverage above 400X. We thus downsample the BAMs before variant calling. The workflow does this by simply selecting the 150 longest reads per strand for each reference sequence (this number can be changed with --medaka_target_depth_per_strand
). Since the workflow has not been made with tiling schemes in mind, it is a fair assumption that most reads (or at least the longest ones in case of RBK data) cover most of the respective amplicon. However, since you are running tiled amplicons and each read only covers a fraction of the reference, not enough reads mapping to that particular region are selected and even though the right variants are called, they are flagged by the depth filter and not included in the consensus.
There are two things you can do to get around this:
--medaka_target_depth_per_strand
to a very high value that makes sure no reads are dropped during that filtering step. This is the easier approach, but I would certainly not recommend it as it (i) suffers from the all the usual tiled amplicon problems (like potentially very uneven coverage etc.) and (ii) might drastically exceed the recommended coverage for Medaka for some regions.We will update the docs to make it more explicit that tiled amplicons are not supported.
Thank you for the suggestions! I have tried both options and both seem to work fine. Option 1 is the easiest since it immediately results in a complete genome, so I prefer to work with this. I have now set the Medaka target depth per strand at 400, and this seems to do the trick. But I assume this will depend on the sequencing depth (mean coverage) of the sequencing run, which was relatively low in this run. So for runs with a higher mean coverage I would have to set a higher target depth, is this correct?
Ask away!
I am running the wf-amplicon on PCR amplicons covering a complete virus genome (4 overlapping long amplicons per sample, pooled in one barcode) and map the reads to a reference genome of this virus to determine the viral genomic sequence in my sample. But when I compare the consensus sequence to the mapped reads in igv, I notice that in a number of positions the base called in the consensus does not reflect the base that is present at that position in the vaste majority of the reads. Even when this base is present in 99% of the reads. What is the explanation for this, and can I prevent this?