DKFZ-ODCF / AlignmentAndQCWorkflows

The DKFZ alignment workflow plugin originally developed at the eilslabs
https://github.com/DKFZ-ODCF/AlignmentAndQCWorkflows/wiki
Other
7 stars 5 forks source link

Safer online QC statistics on the SAM output stream for post-mortem analysis of alignment (or other processes) #37

Open vinjana opened 6 years ago

vinjana commented 6 years ago

Motivation: If an alignment job (or whatever other job) takes exceptionally long to run, requires exceptionally large memory resources, or shows similar anomalies, it is time-consuming to identify possible characteristics in the data itself that may cause the anomaly. In OTP the online statistics could be shown and indicate to the researcher problems with their sample.

Goal: Do certain QC statistics "on line" on the output (SAM) stream of the alignment (VCF or whatever) and secure these at regular intervals. "On line" here means, that the statistics should be written for individual chunks of reads in the stream, e.g. every 10e6 reads, and/or aggregated over the full sample seen up to the moment. The online statistics need to saved repeatedly to disc during the processing and must not be deleted in the end. Currently, all statistics file are empty and QC scripts just dump their results at the end of processing.

For the alignment and merging steps, the following statistics should be culled from the per-lane SAM stream at regular intervals

All statistics are interesting that may relate to an exceptionally long runtime or otherwise failing jobs during alignment or any of its follow-up processing steps.