cancerit / cgpPindel

Cancer Genome Project Insertion/Deletion detection pipeline based around Pindel
http://cancerit.github.io/cgpPindel/
GNU Affero General Public License v3.0
28 stars 5 forks source link

Checking for corrupt intermediates #79

Closed keiranmraine closed 5 years ago

keiranmraine commented 5 years ago

We have encountered some instances where a chunk of a chromosome is missing with some surrounding events having a change in contributing reads. We suspect this is due to on-write file corruption during the input generation step.

There are 2 things we can do to try and catch this:

  1. Check all generated files at the end of the input step for NUL characters, one approach is
    • zgrep -alP '\x00 generated.txt.gz'
    • outputs filename if an issue is found
    • exit code = 1 when all is well
    • can run this in parallel based on the number of cores used during input gen.
  2. Capture total bytes written to each file in read_to_disk:
    • On completion of writes run gzip -dc generated.txt.gz | wc -c
    • can run this in parallel based on the number of cores used during input gen.

Can probably combine both into a single command with tee and named pipes to save reading and decompressing twice (although files are relatively small so likely to hit disk cache).

Timing info on 300MB compressed file:

cgppipe@cgp-7-2-01: /usr/bin/time zgrep -alP '\x00' 2153_2750831/Pindel/tmpPindel/PD38334dn/1.txt.gz
Command exited with non-zero status 1 ### expected
14.37user 1.14system 0:11.29elapsed 137%CPU (0avgtext+0avgdata 6432maxresident)k
0inputs+8outputs (0major+4577minor)pagefaults 0swaps

cgppipe@cgp-7-2-01: /usr/bin/time bash -c 'gzip -dc 2153_2750831/Pindel/tmpPindel/PD38334dn/1.txt.gz | wc -c'
1305295654
10.55user 1.33system 0:11.26elapsed 105%CPU (0avgtext+0avgdata 5760maxresident)k
0inputs+0outputs (0major+1164minor)pagefaults 0swaps
keiranmraine commented 5 years ago

This would be a sneaky way to do it so that finding a file with nul character will result in a change to the byte count, but hide it all in a single command without having to handle non-zero exit or more complex test outputs:

$ bash -c 'gzip -cd 2153_2750831/Pindel/tmpPindel/PD38334dn/1.txt.gz | tee >(grep -alP "\x00" || true) | wc -c'
1305295654

Wrapping bash command not needed if using PCAP command exec code as it does this for us.

time output for this make this comparable with just running gunzip -c file.txt.gz | wc -c:

$ /usr/bin/time bash -c 'gzip -cd 2153_2750831/Pindel/tmpPindel/PD38334dn/1.txt.gz | tee >(grep -alP "\x00" || true) | wc -c'
1305295654
10.81user 4.10system 0:11.35elapsed 131%CPU (0avgtext+0avgdata 5776maxresident)k
0inputs+0outputs (0major+1526minor)pagefaults 0swaps