ga4gh / quality-control-wgs

Home for the GA4GH Quality Control of Whole Genome Sequencing metrics and reference implementations
https://www.ga4gh.org/product/wgs-quality-control-standards/
Apache License 2.0
7 stars 3 forks source link

improved `mean_insert_size` metric definition #14

Closed nicolas-bertin closed 8 months ago

nicolas-bertin commented 1 year ago

proposal to improve mean_insert_size metric definition from

https://github.com/ga4gh/quality-control-wgs/blob/e692682078a3f47b8160cc1ef74614227264b847/metrics_definitions/metrics_definitions.md?plain=1#L83-L88

to [Edit]

https://github.com/ga4gh/quality-control-wgs/blob/bddde198a26f1311a0188f8792b71d6fe704949d/metrics_definitions/metrics_definitions.md?plain=1#L59-L65

see #7

mhebrard commented 1 year ago

https://github.com/ga4gh/quality-control-wgs/blob/bddde198a26f1311a0188f8792b71d6fe704949d/metrics_definitions/metrics_definitions.md?plain=1#L59-L65

justinjj24 commented 1 year ago

In modifying the metric mean_insert_size computing by using samtools stats it appears that only marginal difference from removing duplicate reads of the 30x good quality data.

However this removing duplicate reads imply the need to run 2 times samtools stats( with & without duplicates) for the metrics pct_reads_mapped and pct_reads_properly_paired heavy compute / marginal gain. Also, a concern about the poor quality data with higher number of duplicate reads!

mhebrard commented 1 year ago

summary of sugegstions:

mhebrard commented 1 year ago

https://github.com/ga4gh/quality-control-wgs/blob/bddde198a26f1311a0188f8792b71d6fe704949d/metrics_definitions/metrics_definitions.md?plain=1#L59-L65