epi2me-labs / wf-basecalling

Other
28 stars 8 forks source link

Average Qscore #28

Closed valery-shap closed 3 months ago

valery-shap commented 6 months ago

Hello,

Thank you for a useful workflow! I've one question. Previously (with guppy), a sequencing report was generated and it was used by tools which extracted all statistics. I found solution how to generate this file using Dorado and some additional steps, but it is more comfortable to use a wf-basecalling workflow. Could you please advice how statistic information (for example, average quality of reads) could be extracted? I see only a graph "Read quality" in a file "wf_basecalling_report.html" and opportunity to download it (not a table).

A lot of thanks, Valery

cjw85 commented 5 months ago

There is a file produced by the workflow which has per-read statistics, it is used to generate the report file. The workflow doesn't appear to publish this as a final output though, we can change that.

valery-shap commented 5 months ago

Thank you very much for reply. It would be great!

because alternative way is to use dorado only version and the only route (if I understood right) is: fast5 files transform to pod5 files, then run Dorado with bam output, then run "dorado summary" (it works with only bam files) which will give a "sequencing_summary" guppy like report, then transform bam files to fastq files using samtools. and finally extract qscores from a sequencing_summary file.

Does the workflow use the same route to get statistics information? so is the value for the qscore of the read in this report generated by a workflow the same as the value from a "mean_qscore_template" column from the output table of the command dorado summary?

Also, I realyzed that estimating average Qscore using , for example, seqkit after basecalling is not right, because: https://github.com/shenwei356/seqkit/issues/328 "you can’t just do simple arithmetic mean of all the qscores, because it won’t be a representation of the mean error rate then." https://community.nanoporetech.com/posts/what-is-the-base-value-for

Sorry for long explanation, but it seems that it's important to clarify basic definitions and be sure the same values are discussed.

A lot of thanks, Valery

cjw85 commented 4 months ago

Does the workflow use the same route to get statistics information?

No. The workflow has functionality that pre-existed the dorado summary command.

so is the value for the qscore of the read in this report generated by a workflow the same as the value from a "mean_qscore_template" column from the output table of the command dorado summary?

This numbers are similar. dorado trims an arbitrary 60 bases from the front of reads when calculating a mean quality score. The program the workflow uses to create this statistic does not apply such trimming.

valery-shap commented 3 months ago

Hi,

Thank you for your explanation. It'll be very useful if a table with Q scores is added.

Best regards, Valery