epi2me-labs / wf-amplicon

Other
19 stars 5 forks source link

Consensus sequence generated by wf-amplicon in variant calling mode does not reflect majority of reads as seen in igv #23

Open arecto0 opened 1 month ago

arecto0 commented 1 month ago

Ask away!

I am running the wf-amplicon on PCR amplicons covering a complete virus genome (4 overlapping long amplicons per sample, pooled in one barcode) and map the reads to a reference genome of this virus to determine the viral genomic sequence in my sample. But when I compare the consensus sequence to the mapped reads in igv, I notice that in a number of positions the base called in the consensus does not reflect the base that is present at that position in the vaste majority of the reads. Even when this base is present in 99% of the reads. What is the explanation for this, and can I prevent this?

julibeg commented 1 month ago

Hi @arecto0, thanks for your interest in the workflow!

I am running the wf-amplicon on PCR amplicons covering a complete virus genome (4 overlapping long amplicons per sample, pooled in one barcode) and map the reads to a reference genome of this virus to determine the viral genomic sequence in my sample

Could you please clarify: are you running the workflow without a reference and then manually map the reads against your reference or do you provide the reference when running the workflow as well?

Also please note that generally wf-amplicon has not been designed for tiled amplicon schemes.

arecto0 commented 1 month ago

I run the workflow with a reference (so in variant callling mode). To easily check whether the consensus that is generated is in accordance with the reads, I run the wf-amplicon again using the consensus that was generated in the first run as reference. I would expect the consensus to reflect the majority of variants present in the reads, but this is not always the case.

Would you recommend using another Epi2ME workflow for analysis of tiled amplicon sequence data?

julibeg commented 1 month ago

I run the wf-amplicon again using the consensus that was generated in the first run as reference. I would expect the consensus to reflect the majority of variants present in the reads, but this is not always the case.

Sorry, I still don't fully understand. Are you speaking about the first or second consensus? Also, are you viewing the BAMs created by the workflow in IGV or aligning the raw reads against the consensus yourself (this makes a difference because my suspicion is that your issue is related to how reads are filtered during pre-processing)?

In general, when running variant calling mode with several amplicons per sample we recommend setting --drop_frac_longest_reads 0 --take_longest_remaining_reads false. You could give this one a try and see if the problem persists.

Would you recommend using another Epi2ME workflow for analysis of tiled amplicon sequence data?

Unfortunately there is no dedicated workflow for tiled amplicons at the moment.

arecto0 commented 1 month ago

I view the BAMs created by the workflow, so I do NOT align the raw reads against the consensus myself. The consensus generated by the first wf-amplicon run (using a more general reference) and the second wf-amplicon (using the consensus of the first run as reference) do not show any differences.

arecto0 commented 1 month ago

I have now tried the wf-amplicon without a reference, and then using the de novo generated consensus as reference in a new wf-amplicon run in variant calling mode, and this seems to work better.

julibeg commented 1 month ago

OK, got it now. You are seeing variants in the BAM output by the wf that are not reflected in the VCF / consensus. Could you please share the reference used in the first wf-amplicon run alongside the output files (VCF, consensus FASTA, BAM) for one of the samples where you see this issue? Thanks.

I have now tried the wf-amplicon without a reference, Please not that de novo mode has not been tested with tiled amplicons and might give sub-optimal results

arecto0 commented 1 month ago

medaka.consensus.fasta.txt reference_sanitized_seqIDs.fasta.txt

I cannot seem to upload the VCF file and the bam file here...

arecto0 commented 1 month ago

medaka.annotated.vcf.gz

BAM file does not upload...

arecto0 commented 1 month ago

Could not upload these so I am trying like this…

From: Julian Libiseller-Egger @.> Sent: woensdag 7 augustus 2024 17:54 To: epi2me-labs/wf-amplicon @.> Cc: Annabel Rector @.>; Mention @.> Subject: Re: [epi2me-labs/wf-amplicon] Consensus sequence generated by wf-amplicon in variant calling mode does not reflect majority of reads as seen in igv (Issue #23)

OK, so you're variants in the BAM that are not reflected in the VCF / consensus. Could you please share the reference used in the first wf-amplicon run alongside the output files (VCF, consensus FASTA, BAM) for one of the samples where you see this issue? Thanks.

I have now tried the wf-amplicon without a reference, Please not that de novo mode has not been tested with tiled amplicons and might give sub-optimal results

— Reply to this email directly, view it on GitHubhttps://github.com/epi2me-labs/wf-amplicon/issues/23#issuecomment-2273798077, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BKIX4GRIG26EZP4B3NGOOATZQI7JBAVCNFSM6AAAAABMERF2R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTG44TQMBXG4. You are receiving this because you were mentioned.Message ID: @.**@.>> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/epi2me-labs/wf-amplicon/issues/23#issuecomment-2273798077", "url": "https://github.com/epi2me-labs/wf-amplicon/issues/23#issuecomment-2273798077", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

julibeg commented 1 month ago

just had a look and it appears the variants listed in the VCF are indeed present in the consensus. If you can't share the BAM (it might be too big), could you instead send a screenshot of the IGV view (of the first BAM with the reference) with a variant that you are missing from the consensus?

arecto0 commented 1 month ago

igv_wfamp2 This is the problem I referred to: when I run the wf-amplicon again, using the consensus that was generated in the first run, and then look at the BAM file, I see that in some positions the consensus does not reflect what is in the reads.

julibeg commented 1 month ago

Hmm, hard to tell what's going on there without the input data or generated BAM. How big is your BAM? The size limit for files on github is 25MB. You could try sub-sampling the BAM with samtools view to a smaller size and upload this instead.

julibeg commented 1 month ago

In the VCF there are actually variants in this region, but they didn't have enough depth to make it past the low depth filter (the threshold is 20 per default)

HRSVB_MZ515836_CG_  5300    .   C   T   43.644  LOW_DEPTH   AR=0,1;DP=19;DPS=3,16;DPSP=19;SC=683,3648,710,3783;SR=0,0,3,15  GT:GQ   1:44
HRSVB_MZ515836_CG_  5322    .   T   G   55.715  LOW_DEPTH   AR=0,0;DP=19;DPS=3,16;DPSP=19;SC=597,3201,624,3345;SR=0,0,3,16  GT:GQ   1:56
HRSVB_MZ515836_CG_  5325    .   C   T   40.594  LOW_DEPTH   AR=0,0;DP=19;DPS=3,16;DPSP=19;SC=592,3181,619,3325;SR=0,0,3,16  GT:GQ   1:41
HRSVB_MZ515836_CG_  5327    .   C   T   44.233  LOW_DEPTH   AR=0,0;DP=19;DPS=3,16;DPSP=19;SC=607,3261,634,3405;SR=0,0,3,16  GT:GQ   1:44
HRSVB_MZ515836_CG_  5335    .   C   T   43.265  LOW_DEPTH   AR=0,0;DP=19;DPS=3,16;DPSP=19;SC=602,3266,629,3410;SR=0,0,3,16  GT:GQ   1:43
arecto0 commented 1 month ago

aligned.sorted.zip It worked by zipping, should have tried that earlier...

julibeg commented 1 month ago

thanks; is this the BAM from the original run (with the original reference)? If not, could you please share this as well?

arecto0 commented 1 month ago

This is the bam I just uploaded. I will also upload the bam from the second run, with the consensus as reference. aligned.sorted.zip

julibeg commented 1 month ago

ah ok, gotcha. I'm fairly confident now that the pre-Medaka downsampling is the cause of the issue.

Some background info: Medaka's performance degrades at high coverage above 400X. We thus downsample the BAMs before variant calling. The workflow does this by simply selecting the 150 longest reads per strand for each reference sequence (this number can be changed with --medaka_target_depth_per_strand). Since the workflow has not been made with tiling schemes in mind, it is a fair assumption that most reads (or at least the longest ones in case of RBK data) cover most of the respective amplicon. However, since you are running tiled amplicons and each read only covers a fraction of the reference, not enough reads mapping to that particular region are selected and even though the right variants are called, they are flagged by the depth filter and not included in the consensus.

There are two things you can do to get around this:

  1. Set --medaka_target_depth_per_strand to a very high value that makes sure no reads are dropped during that filtering step. This is the easier approach, but I would certainly not recommend it as it (i) suffers from the all the usual tiled amplicon problems (like potentially very uneven coverage etc.) and (ii) might drastically exceed the recommended coverage for Medaka for some regions.
  2. Make a reference that contains the expected sequences of each of your amplicons (i.e. this would be a FASTA file with 4 sequences in your case). wf-amplicon will do the variant calling (and pre-Medaka filtering) for each reference sequence individually and the filtering shouldn't be an issue. You would still need to think about how to deal with conflicting variant calls for the different amplicons in the overlapping regions though (if they arise).

We will update the docs to make it more explicit that tiled amplicons are not supported.

arecto0 commented 1 month ago

Thank you for the suggestions! I have tried both options and both seem to work fine. Option 1 is the easiest since it immediately results in a complete genome, so I prefer to work with this. I have now set the Medaka target depth per strand at 400, and this seems to do the trick. But I assume this will depend on the sequencing depth (mean coverage) of the sequencing run, which was relatively low in this run. So for runs with a higher mean coverage I would have to set a higher target depth, is this correct?