Closed keiranmraine closed 5 years ago
This would be a sneaky way to do it so that finding a file with nul character will result in a change to the byte count, but hide it all in a single command without having to handle non-zero exit or more complex test outputs:
$ bash -c 'gzip -cd 2153_2750831/Pindel/tmpPindel/PD38334dn/1.txt.gz | tee >(grep -alP "\x00" || true) | wc -c'
1305295654
Wrapping bash command not needed if using PCAP command exec code as it does this for us.
time
output for this make this comparable with just running gunzip -c file.txt.gz | wc -c
:
$ /usr/bin/time bash -c 'gzip -cd 2153_2750831/Pindel/tmpPindel/PD38334dn/1.txt.gz | tee >(grep -alP "\x00" || true) | wc -c'
1305295654
10.81user 4.10system 0:11.35elapsed 131%CPU (0avgtext+0avgdata 5776maxresident)k
0inputs+0outputs (0major+1526minor)pagefaults 0swaps
We have encountered some instances where a chunk of a chromosome is missing with some surrounding events having a change in contributing reads. We suspect this is due to on-write file corruption during the input generation step.
There are 2 things we can do to try and catch this:
zgrep -alP '\x00 generated.txt.gz'
gzip -dc generated.txt.gz | wc -c
Can probably combine both into a single command with tee and named pipes to save reading and decompressing twice (although files are relatively small so likely to hit disk cache).
Timing info on 300MB compressed file: