Motivation: If an alignment job (or whatever other job) takes exceptionally long to run, requires exceptionally large memory resources, or shows similar anomalies, it is time-consuming to identify possible characteristics in the data itself that may cause the anomaly. In OTP the online statistics could be shown and indicate to the researcher problems with their sample.
Goal: Do certain QC statistics "on line" on the output (SAM) stream of the alignment (VCF or whatever) and secure these at regular intervals. "On line" here means, that the statistics should be written for individual chunks of reads in the stream, e.g. every 10e6 reads, and/or aggregated over the full sample seen up to the moment. The online statistics need to saved repeatedly to disc during the processing and must not be deleted in the end. Currently, all statistics file are empty and QC scripts just dump their results at the end of processing.
For the alignment and merging steps, the following statistics should be culled from the per-lane SAM stream at regular intervals
Insert-Size Statistics (min, max, Q1, Q2, Q3) (too long fragments)
Fraction of read pairs aligned to different chromosomes
Soft-Clipping Rate (shorter reads are ambiguous to align)
Length distribution of remaining soft-clipped reads (shorter reads are ambiguous to align)
Number+Proportion of FF, FR, RR read pairs
Number+Proportion of reads aligned with large gaps
Number+Proportion of reads aligned with Smith-Waterman secondary alignment step in BWA (slower; XT attribute)
Number+Proportion of unaligned reads
Distribution parameters of suboptimal hits in BWA (X1 attribute)
Distribution parameters of number of best hits (X0 attribute)
others? (please add!)
All statistics are interesting that may relate to an exceptionally long runtime or otherwise failing jobs during alignment or any of its follow-up processing steps.
Motivation: If an alignment job (or whatever other job) takes exceptionally long to run, requires exceptionally large memory resources, or shows similar anomalies, it is time-consuming to identify possible characteristics in the data itself that may cause the anomaly. In OTP the online statistics could be shown and indicate to the researcher problems with their sample.
Goal: Do certain QC statistics "on line" on the output (SAM) stream of the alignment (VCF or whatever) and secure these at regular intervals. "On line" here means, that the statistics should be written for individual chunks of reads in the stream, e.g. every 10e6 reads, and/or aggregated over the full sample seen up to the moment. The online statistics need to saved repeatedly to disc during the processing and must not be deleted in the end. Currently, all statistics file are empty and QC scripts just dump their results at the end of processing.
For the alignment and merging steps, the following statistics should be culled from the per-lane SAM stream at regular intervals
All statistics are interesting that may relate to an exceptionally long runtime or otherwise failing jobs during alignment or any of its follow-up processing steps.