epi2me-labs / wf-amplicon

Other
16 stars 5 forks source link

General questions regarding workflow report metrics (outputs) and variant calling #18

Closed SAN-AU closed 1 week ago

SAN-AU commented 1 week ago

Ask away!

Using the desktop application, I've completed several analyses of multi-gene amplicon samples mapped to reference sequences in variant calling mode. This has worked well but I have a couple of general questions regarding metrics in the workflow report and the process of variant calling that will provide better context for interpretation of outputs (for a new-comer like myself).

1. Workflow report What does the 'Mean Acc.' value represent? Assuming it stands for mean accuracy, is this related to the accuracy of the read alignment to the reference?

  1. Variant calling I would like clarity/confirmation of the process (as I understand it) if you wouldn't mind.

i. From the specified number of reads selected for initial alignment, only 150 of these reads (by default) are then used for variant calling, but this translates to 300x read depth taking into account forward and reverse strands?

ii. And for variants to 'pass' and be incorporated into the consensus, a read depth of only 20 (by default), at that position, is required?

iii. There is no minimum quality score a variant needs to be to 'pass'? Pardon my ignorance, but would it be worthwhile considering including an option for filtering out low quality variants to improve robustness?

Thank you! Appreciate the guidance.

julibeg commented 1 week ago

Hi @SAN-AU,

  1. "Mean Acc." is the mean alignment accuracy of reads from the sample in question mapping against the respective amplicon.
  2. i. Yes, we aim for a coverage of 300X before running Medaka. ii. Not all reads cover the whole amplicon (especially not when an RBK kit was used) and thus coverage is not guaranteed to be uniformly 300X everywhere. The LOW_DEPTH filter is mainly there to indicate that a variant might have been called at very low coverage (e.g. at the end of the amplicon). iii. Medaka uses a neural network to decide whether something is a variant or not and does not output "suboptimal" bases. Therefore, filtering on quality won't improve F1 in the vast majority of cases, but rather will just reduce recall. However, if users want to filter their VCFs after running the workflow, they can of course do so.

I hope this was helpful. Please let us know if you have any other questions!

SAN-AU commented 1 week ago

Thank you for clarifying those aspects - appreciate it!

julibeg commented 1 week ago

Happy to help!