BCCDC-PHL / alignment-variants

Pipeline to perform alignment & variant calling on whole-genome sequence data
0 stars 0 forks source link

Improve Qualimap genome report parsing #32

Closed dfornika closed 1 month ago

dfornika commented 1 month ago

We use a python script to convert the "genome results" report from its original format to csv, which makes it easier to collect and compare metrics among multiple samples.

The original report format looks like this:

BamQC report
-----------------------------------

>>>>>>> Input

     bam file = SAMPLE-1_short.bam
     outfile = SAMPLE-1_short_bamqc/genome_results.txt

>>>>>>> Reference

     number of bases = 6,008,759 bp
     number of contigs = 11

>>>>>>> Globals

     number of windows = 410

     number of reads = 20,118,038
     number of mapped reads = 20,096,173 (99.89%)
     number of supplementary alignments = 335,071 (1.67%)
     number of secondary alignments = 0

     number of mapped paired reads (first in pair) = 10,056,841

...etc...

...but our current parsing script misses a few metrics that may be relevant, such as the number of mapped bases. We should update the script to collect more relevant metrics.

dfornika commented 1 month ago

It may also make sense to parse the report into json format, since it's already divided into sections (Input, Reference, Globals, etc.)

If we do that it may make sense to re-write the csv-generating script to instead parse the json output.