broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
342 stars 60 forks source link

Make polishing stats machine-readable #148

Open kim-fehl opened 2 years ago

kim-fehl commented 2 years ago

I polish several similar assemblies and would like to easily aggregate stats for their polishing. As for now, there are human-readable stats in a well-described log file, however, its parsing is complicated. Is it possible to produce machine-readable tsv or json file (which could be parsed by a future version of MultiQC...)

Here is the code I use now for the perfect case when each assembly has the only contig (in a Snakemake's shell flavor):

(
    printf "SampleID\tPercConfirmedBases\tCoverage\tCorrectedSNPs\tCorrectedAmbiguousBases"
    printf "\tCorrectedSmallInsertions\tCorrectedSmallDeletions\tFixedLocalBreaks\tFixedGaps\n"
    paste  <( echo {SAMPLES_STR} | tr ' ' '\n' ) \
           <( parallel -k "grep -oP 'Confirmed.*\(\K([0-9.]+)' {{}}" ::: {input} ) \
           <( parallel -k "grep -oP 'Mean total coverage: \K([0-9.]+)' {{}} | sed 's/$/x/g'" ::: {input} ) \
           <( parallel -k "grep -oP '([0-9]+)(?= snps)' {{}}" ::: {input} ) \
           <( parallel -k "grep -oP '([0-9]+)(?= ambiguous bases)' {{}}" ::: {input} ) \
           <( parallel -k "grep -oP '([0-9]+)(?= small insertions)' {{}}" ::: {input} ) \
           <( parallel -k "grep -oP '([0-9]+)(?= small deletions)' {{}}" ::: {input} ) \
           <( parallel -k "grep '^fix break' {{}} | wc -l" ::: {input} ) \
           <( parallel -k "grep '^fix gap' {{}} | wc -l" ::: {input} )
) > {output}