Closed HenrikBengtsson closed 4 years ago
To speed this step up, it might be faster to produce md5 checksum files, e.g.
[cbctest2@n0 lg3-demo-2018-12-27-003]$ echo "$p"
output/LG3/trim/Z00599t10-trim/Z00599t10-trim_R1.fastq.gz
[cbctest2@n0 lg3-demo-2018-12-27-003]$ echo "${p/.gz/}.md5"
output/LG3/trim/Z00599t10-trim/Z00599t10-trim_R1.fastq.md5
[cbctest2@n0 lg3-demo-2018-12-27-003]$ zcat "$p" | md5sum --text > "${p/.gz/}.md5"
[cbctest2@n0 lg3-demo-2018-12-27-003]$ cat "${p/.gz/}.md5"
248ac10a2504fa8ba6a40697ff75a151 -
This will allow us to compare *.fastq.md5 files directly.
Now, using the develop branch:
[cbctest2@n17 lg3-demo-2018-12-27-002]$ LG3_TEST_TRUTH=../lg3-demo-2018-10-08/truth PATIENT=Patient157t10 lg3 test validate
Sourced: /home/henrik/repositories/UCSF-CostelloLab/LG3_Pipeline-develop/lg3.conf
*** Configuration
[OK] PROJECT=LG3
[OK] PATIENT=Patient157t10
[OK] CONV=patient_ID_conversions.tsv
[OK] LG3_TEST_TRUTH=../lg3-demo-2018-10-08/truth
*** Trimming of FASTQ Files
[OK] file tree ('output/LG3/trim/Z00*-trim')
[OK] file sizes ('output/LG3/trim/Z00*-trim/*')
[OK] file md5 checksums after gunzip ('output/LG3/trim/Z00*-trim/*.fastq.gz')
*** BWA Alignment of FASTQ Files
...
Using this new validation, I've confirmed that all of the following LG3 versions produced identical trimmed FASTQ files;
I've added the same MD5 checksum validation of the BWA aligned BAM files and the corresponding BAM index files. The result is the same; all of the above versions produce identical BAM and BAI files, e.g.
[cbctest2@n17 lg3-demo-2018-12-27-002]$ LG3_TEST_TRUTH=../lg3-demo-2020-05-16/truth PATIENT=Patient157t10 lg3 test validate
Sourced: /home/henrik/repositories/UCSF-CostelloLab/LG3_Pipeline-develop/lg3.conf
*** Configuration
[OK] PROJECT=LG3
[OK] PATIENT=Patient157t10
[OK] CONV=patient_ID_conversions.tsv
[OK] LG3_TEST_TRUTH=../lg3-demo-2020-05-16/truth
*** Trimming of FASTQ Files
[OK] file tree ('output/LG3/trim/Z00*-trim')
[OK] file sizes ('output/LG3/trim/Z00*-trim/*')
[OK] file md5 checksums (after gunzip) ('output/LG3/trim/Z00*-trim/*.fastq.gz')
*** BWA Alignment of FASTQ Files
[OK] file tree ('output/LG3/exomes')
[OK] file sizes ('output/LG3/exomes/Z00*/*')
[OK] file md5 checksums ('output/LG3/exomes/Z00*/*.bai')
[OK] file md5 checksums ('output/LG3/exomes/Z00*/*.bam')
...
The trim step produces
*.fastq.gz
files.lg3 test validate
compare the file sizes of these files. Unfortunately, we cannot compare their content usingmd5sum
because gzip embeds the file-time stamps in their headers. This means that two identical files will be different after gzip:ing only because of the file attributes.To compare, we need to gunzip, e.g.
By adding this validation, we can assert that the trim step produces identical results without any random component.