UCSF-Costello-Lab / LG3_Pipeline

The original LG3 pipeline
https://github.com/UCSF-Costello-Lab/LG3_Pipeline
0 stars 0 forks source link

lg3 test validate: Compare content of FASTQ and BWA BAM files #145

Closed HenrikBengtsson closed 4 years ago

HenrikBengtsson commented 4 years ago

The trim step produces *.fastq.gz files. lg3 test validate compare the file sizes of these files. Unfortunately, we cannot compare their content using md5sum because gzip embeds the file-time stamps in their headers. This means that two identical files will be different after gzip:ing only because of the file attributes.

To compare, we need to gunzip, e.g.

[cbctest2@n0 lg3-demo-2018-12-27-003]$ a=../lg3-demo-2018-12-27-002
[cbctest2@n0 lg3-demo-2018-12-27-003]$ b=../lg3-demo-2018-12-27-003
[cbctest2@n0 lg3-demo-2018-12-27-003]$ diff -b "$a/$f" "$b/$f"
Binary files ../lg3-demo-2018-12-27-002/output/LG3/trim/Z00599t10-trim/Z00599t10-trim_R1.fastq.gz and ../lg3-demo-2018-12-27-003/output/LG3/trim/Z00599t10-trim/Z00599t10-trim_R1.fastq.gz differ
[cbctest2@n0 lg3-demo-2018-12-27-003]$ diff <(zcat "$a/$f") <(zcat "$b/$f")
[cbctest2@n0 lg3-demo-2018-12-27-003]$ 

By adding this validation, we can assert that the trim step produces identical results without any random component.

HenrikBengtsson commented 4 years ago

To speed this step up, it might be faster to produce md5 checksum files, e.g.

[cbctest2@n0 lg3-demo-2018-12-27-003]$ echo "$p"
output/LG3/trim/Z00599t10-trim/Z00599t10-trim_R1.fastq.gz
[cbctest2@n0 lg3-demo-2018-12-27-003]$ echo "${p/.gz/}.md5"
output/LG3/trim/Z00599t10-trim/Z00599t10-trim_R1.fastq.md5
[cbctest2@n0 lg3-demo-2018-12-27-003]$ zcat "$p" | md5sum --text > "${p/.gz/}.md5"
[cbctest2@n0 lg3-demo-2018-12-27-003]$ cat "${p/.gz/}.md5"
248ac10a2504fa8ba6a40697ff75a151  -

This will allow us to compare *.fastq.md5 files directly.

HenrikBengtsson commented 4 years ago

Now, using the develop branch:

[cbctest2@n17 lg3-demo-2018-12-27-002]$ LG3_TEST_TRUTH=../lg3-demo-2018-10-08/truth PATIENT=Patient157t10 lg3 test validate
Sourced: /home/henrik/repositories/UCSF-CostelloLab/LG3_Pipeline-develop/lg3.conf
*** Configuration                                     
[OK] PROJECT=LG3                                        
[OK] PATIENT=Patient157t10            
[OK] CONV=patient_ID_conversions.tsv     
[OK] LG3_TEST_TRUTH=../lg3-demo-2018-10-08/truth                   

*** Trimming of FASTQ Files                                                                                                
[OK] file tree ('output/LG3/trim/Z00*-trim')                                     
[OK] file sizes ('output/LG3/trim/Z00*-trim/*')
[OK] file md5 checksums after gunzip ('output/LG3/trim/Z00*-trim/*.fastq.gz')

*** BWA Alignment of FASTQ Files    
...

Using this new validation, I've confirmed that all of the following LG3 versions produced identical trimmed FASTQ files;

HenrikBengtsson commented 4 years ago

I've added the same MD5 checksum validation of the BWA aligned BAM files and the corresponding BAM index files. The result is the same; all of the above versions produce identical BAM and BAI files, e.g.

[cbctest2@n17 lg3-demo-2018-12-27-002]$ LG3_TEST_TRUTH=../lg3-demo-2020-05-16/truth PATIENT=Patient157t10 lg3 test validate
Sourced: /home/henrik/repositories/UCSF-CostelloLab/LG3_Pipeline-develop/lg3.conf
*** Configuration                                                     
[OK] PROJECT=LG3                                                          
[OK] PATIENT=Patient157t10                                                                                                                                
[OK] CONV=patient_ID_conversions.tsv                                                                                                                                    
[OK] LG3_TEST_TRUTH=../lg3-demo-2020-05-16/truth                      

*** Trimming of FASTQ Files                                                                         
[OK] file tree ('output/LG3/trim/Z00*-trim')                               
[OK] file sizes ('output/LG3/trim/Z00*-trim/*')                                               
[OK] file md5 checksums (after gunzip) ('output/LG3/trim/Z00*-trim/*.fastq.gz')               

*** BWA Alignment of FASTQ Files                                                         
[OK] file tree ('output/LG3/exomes')                                                                    
[OK] file sizes ('output/LG3/exomes/Z00*/*')                                                                                
[OK] file md5 checksums ('output/LG3/exomes/Z00*/*.bai')                             
[OK] file md5 checksums ('output/LG3/exomes/Z00*/*.bam')   
...