Is it possible to download a fasta of references which align to your generated consensus?

epi2me-labs / wf-amplicon

Other

16 stars 5 forks source link

Is it possible to download a fasta of references which align to your generated consensus? #13

Closed Marc-Higgins closed 4 months ago

Marc-Higgins commented 4 months ago

Hi folks,

I am using wf-amplicon to extract consensus sequences which may align to one of many references. The report is very useful in that I can see which reference sequences are present in my sample, and at what proportion. Is there a way to extract this information programmatically from the outputted files?

Ideally, I'd like to be able to pull the consensus sequence(s) generated for each present sample, and also the proportion of reads mapping to each amplicon in my sample.

Thanks! Marc.

julibeg commented 4 months ago

Hi Marc!

The consensus sequences and BAM files for each sample are copied to the output directory of the workflow. Please let us know if you have any other questions.

Marc-Higgins commented 3 months ago

Hi @julibeg - sorry for the delay here.

In my consensus file e.g.:

output/barcode<xx>/consensus/medaka.consensus.fasta

file, the reported consensus sequences are simply just the inputted references combined into one file.

For example I inputted 12 reference sequences, and in a particular barcode the wf-amplicon noted presence of 2 distinct amplicons. However in the medaka.consensus.fasta file for this barcode - all 12 reference sequences are present. Is this an expected feature? Ideally, I would like to be able to view the generated consensus sequences (not just the references), with variants or indels as determined by the workflow. Thank you.

julibeg commented 3 months ago

The workflow per default assumes that all reference sequences are present in each barcode / sample. You can change this behaviour by providing a ref column in the sample sheet. Please see the second question in the FAQs for details.

I realise that this is counter-intuitive and we might change this in the future to only emit a consensus for reference sequences that had more than a certain number of reads mapped to them.

Marc-Higgins commented 3 months ago

Thanks for the clarification @julibeg.

Why are the entries in the fasta file within the "consensus" folder without any variants, even when variants are reported in the emitted VCF? Thanks again for your assistance.

julibeg commented 3 months ago

Why are the entries in the fasta file within the "consensus" folder without any variants, even when variants are reported in the emitted VCF?

Could it be that these variants didn't pass the depth filter threshold? They would show up under the "Low depth" tab in the Variants section of the report (and in the VCF they would have LOW_DEPTH in the VCFs FILTER column.

If these variants have PASS in the FILTER column and are still missing from the consensus, then there might be a bug in the workflow. In this case could you please provide me with

the reference you used
the params.json file found in output directory of this run
the outputs (consensus, VCF, BAM) for a barcode/sample for which you observed this

Thanks!

Marc-Higgins commented 3 months ago

@julibeg - they are low LOW_DEPTH denoted in the filter column. This is clarified - thank you!