[Enhancement]: output consensus fasta and mapped bam files

JWDebler commented 1 year ago

What happened?

I would like to get the consensus sequence as a fasta file and might as well output the bam file since it is generated already. The html report could also be enhanced by showing the reference sequence and highlighting any variants.

Cheers

Operating System

Windows 10

Workflow Execution

Command line

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - CLI Execution Profile

Docker

Workflow Version

epi2me-labs/wf-amplicon v0.1.2-g3c36817

Relevant log output

nothing

julibeg commented 1 year ago

Hi there!

I would like to get the consensus sequence as a fasta file and might as well output the bam file since it is generated already.

This has been added in v0.1.3.

The html report could also be enhanced by showing the reference sequence and highlighting any variants.

Thanks for the feedback! Could you please give a few more details on what this would ideally look like / what you would use it for? Specifically, would you need an interactive, zoomable "genome-browser-style" view (potentially also showing the pileup) or would a more static plot simply showing where the variants occur be enough?

JWDebler commented 1 year ago

Hi Juli, well, if I could wish for it I'd like to be able to not just supply a reference sequence, but also a reference annotation (in GFF) and the output report would display the mapping overlayed with the annotation and then a track showing potential variations. Yes, a genome browser style visualisation (with annotation) would be great! Something like this:

Cheers

julibeg commented 1 year ago

Hi @JWDebler! We looked into the genome browser request and explored a couple options, but since users can have many samples as well as multiple references of arbitrary length / complexity of annotation, including something like this in the report runs the risk of dramatically blowing up the size of the HTML and making it unresponsive. We therefore won't pursue this further for now, sorry. If users need to perform such downstream analysis, we recommend using dedicated softwared like IGV or similar.

warthmann commented 1 year ago

the workflow could, however, provide the option to output joint bam and vcf files, with all samples and/or all targets in one bam/vcf. this would make loading it into IGV more straight forward. Also, I note that somewhere along the way the workflow renames chromosome names. In the example data provided, the reference sequence name contains ":" which become "_" in the #CHROM column of the vcf produced by Medaka. This has the consequence that IGV will not recognise them as same (the reference sequence name and the chromosome where the variants have been found) and will display "no variants found".
Another issue I encounter is that medaka apparently isn't using the sample sheet, but labels the sample in the vcf generically with "SAMPLE". This makes it difficult to compare samples in IGV as the all have the same name.

julibeg commented 1 year ago

Thanks for the feedback, @warthmann! We will address these points in the near future.

julibeg commented 1 year ago

v0.3.0 introduced the following changes to make the output play nicer with IGV:

the workflow could, however, provide the option to output joint bam and vcf files

The option --combine_results has been added to do this.

Also, I note that somewhere along the way the workflow renames chromosome names.

Mosdepth and Medaka cannot deal with : or * in the sequence names, which is why we sanitise them. The workflow now also releases the reference file with the sanitized seq. names so that this can be used when loading data into IGV.

medaka apparently isn't using the sample sheet

This has been fixed.

warthmann commented 1 year ago

I tested v0.3.0 and the new features worked nicely with my files! thanks!

epi2me-labs / wf-amplicon