bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

QC Files in “final” directory are identical when using multiple normals #2208

Closed choosehappy closed 6 years ago

choosehappy commented 6 years ago

I have 4 sets of patients, each has a normal, primary, and metastasis

I use a template to call variants on these, and have a csv file which looks like this:

samplename,batch,phenotype,variant_regions
888_PD18790a,b1a,tumor,S04380110_Padded_hg38_trimmed.bed
888_PD18790b,"b1a,b1c",normal,S04380110_Padded_hg38_trimmed.bed
888_PD18790c,b1c,tumor,S04380110_Padded_hg38_trimmed.bed
888_PD18793a,b2a,tumor,S04380110_Padded_hg38_trimmed.bed
888_PD18793b,"b2a,b2c",normal,S04380110_Padded_hg38_trimmed.bed
888_PD18793c,b2c,tumor,S04380110_Padded_hg38_trimmed.bed
888_PD18810a,b3a,tumor,S04380110_Padded_hg38_trimmed.bed
888_PD18810b,"b3a,b3c",normal,S04380110_Padded_hg38_trimmed.bed
888_PD18810c,b3c,tumor,S04380110_Padded_hg38_trimmed.bed
888_PD18796a,b4a,tumor,S04380110_Padded_hg38_trimmed.bed
888_PD18796b,"b4a,b4c",normal,S04380110_Padded_hg38_trimmed.bed
888_PD18796c,b4c,tumor,S04380110_Padded_hg38_trimmed.bed

Which as I understand, if I label the primary as “b1a”, and the metastasis as “b1c”, then I can label the normal as “b1a,b1c” and have it use the sample twice.

This appears to work as expected downstream

My issue is, the QC directory has the same exact files for all of the normal samples, which is very unlikely given that the bam files produced are themselves different. I can use md5sum to show this:

root@sib-pc25:/export/big/ajanowcz/livermet/proj/final# md5sum `find . | grep kraken_summary | grep b |  sort ` 
ce65fe15e645df57e0c32b09813b0899  ./888_PD18790b/qc/kraken/kraken_summary
ce65fe15e645df57e0c32b09813b0899  ./888_PD18793b/qc/kraken/kraken_summary
ce65fe15e645df57e0c32b09813b0899  ./888_PD18796b/qc/kraken/kraken_summary
ce65fe15e645df57e0c32b09813b0899  ./888_PD18810b/qc/kraken/kraken_summary

versus the working directory:

root@sib-pc25:/export/big/ajanowcz/livermet/proj/work# md5sum `find . | grep kraken_summary | grep b |  sort ` 
ce65fe15e645df57e0c32b09813b0899  ./qc/888_PD18790b/kraken/kraken_summary
a61d59d98b7773d40da828435dc66b9d  ./qc/888_PD18793b/kraken/kraken_summary
e58075c2951d1724c38f8f236a5e820d  ./qc/888_PD18796b/kraken/kraken_summary
2953f382ccc13c34874939873f5f3535  ./qc/888_PD18810b/kraken/kraken_summary
root@sib-pc25:/export/big/ajanowcz/livermet/proj/work# 

It looks like bcbio is taking the first normal and copying it into all other directories.

root@sib-pc25:/export/big/ajanowcz/livermet/proj/final# md5sum `find . | grep zip | grep b |  sort ` 
70b34dfc006ffb3e2362c620d1eeea43  ./888_PD18790b/qc/fastqc/888_PD18790b.zip
70b34dfc006ffb3e2362c620d1eeea43  ./888_PD18793b/qc/fastqc/888_PD18790b.zip
70b34dfc006ffb3e2362c620d1eeea43  ./888_PD18796b/qc/fastqc/888_PD18790b.zip
70b34dfc006ffb3e2362c620d1eeea43  ./888_PD18810b/qc/fastqc/888_PD18790b.zip

versus:

root@sib-pc25:/export/big/ajanowcz/livermet/proj/final# cd ..
root@sib-pc25:/export/big/ajanowcz/livermet/proj# cd work/
root@sib-pc25:/export/big/ajanowcz/livermet/proj/work# md5sum `find . | grep zip | grep b |  sort ` 
70b34dfc006ffb3e2362c620d1eeea43  ./qc/888_PD18790b/fastqc/888_PD18790b.zip
0095eb1db884571d115d6f1dcf46cb98  ./qc/888_PD18793b/fastqc/888_PD18793b.zip
7652adc65800709760d97f0472ccc709  ./qc/888_PD18796b/fastqc/888_PD18796b.zip
1af3fa8a03718f08e422c20e7e6d0d65  ./qc/888_PD18810b/fastqc/888_PD18810b.zip
root@sib-pc25:/export/big/ajanowcz/livermet/proj/work# 

Note, the tumor samples have the correct values, its only the normals that appear to be affected

choosehappy commented 6 years ago

This appears to be limited to the QC directory, as the bam files in the "final" directory are all of different sizes:

root@sib-pc25:/export/big/ajanowcz/livermet/proj/final# ls -l `find . | grep bam | grep "b-" | grep -v bai | sort`
-rwxrwxrwx 1 root root    66920020 Jan  4 15:54 ./888_PD18790b/888_PD18790b-disc.bam
-rwxrwxrwx 1 root root 12916031830 Jan  4 15:54 ./888_PD18790b/888_PD18790b-ready.bam
-rwxrwxrwx 1 root root    10506773 Jan  4 15:54 ./888_PD18790b/888_PD18790b-sr.bam
-rwxrwxrwx 1 root root    58365782 Jan  4 16:22 ./888_PD18793b/888_PD18793b-disc.bam
-rwxrwxrwx 1 root root 11690415573 Jan  4 16:22 ./888_PD18793b/888_PD18793b-ready.bam
-rwxrwxrwx 1 root root     7943727 Jan  4 16:22 ./888_PD18793b/888_PD18793b-sr.bam
-rwxrwxrwx 1 root root    66824776 Jan  4 16:25 ./888_PD18796b/888_PD18796b-disc.bam
-rwxrwxrwx 1 root root 11986694572 Jan  4 16:25 ./888_PD18796b/888_PD18796b-ready.bam
-rwxrwxrwx 1 root root     8930630 Jan  4 16:25 ./888_PD18796b/888_PD18796b-sr.bam
-rwxrwxrwx 1 root root   132817279 Jan  4 16:05 ./888_PD18810b/888_PD18810b-disc.bam
-rwxrwxrwx 1 root root 23758126851 Jan  4 16:05 ./888_PD18810b/888_PD18810b-ready.bam
-rwxrwxrwx 1 root root    10376467 Jan  4 16:05 ./888_PD18810b/888_PD18810b-sr.bam
root@sib-pc25:/export/big/ajanowcz/livermet/proj/final# 

As well as other output files (e.g., optitype):

root@sib-pc25:/export/big/ajanowcz/livermet/proj/final# md5sum `find . | grep csv | grep opti | grep b`
68547cacea8baca2f77792b768fe4880  ./888_PD18796b/888_PD18796b-hla-optitype.csv
157466b65a85de73cf982953ea30fdb2  ./888_PD18810b/888_PD18810b-hla-optitype.csv
3c748aa56f665bdd30ebae2e605e87e5  ./888_PD18790b/888_PD18790b-hla-optitype.csv
f060b536986b2d2d2c4e8a4f5d71b5b6  ./888_PD18793b/888_PD18793b-hla-optitype.csv
root@sib-pc25:/export/big/ajanowcz/livermet/proj/final# 
chapmanb commented 6 years ago

Thank you for this detailed report and sorry about the issue. This was due to a recent QC change where we re-used part of the the existing QC dictionary but in shared cases this ended up duplicating it across samples for the outputs. I pushed a fix by explicitly ensuring the dictionary is not shared. If you upgrade to the latest development version (bcbio_nextgen.py upgrade -u development), remove the QC directories in your final output and re-run it should not put the right QC metrics there.

Thank you again for the report and please let us know if you run into any other issues.

choosehappy commented 6 years ago

yup, that did it, thanks!

root@51853b033dc8:/data/livermet/proj/final# md5sum `find . | grep kraken_summary | grep b |  sort `
ce65fe15e645df57e0c32b09813b0899  ./888_PD18790b/qc/kraken/kraken_summary
a61d59d98b7773d40da828435dc66b9d  ./888_PD18793b/qc/kraken/kraken_summary
e58075c2951d1724c38f8f236a5e820d  ./888_PD18796b/qc/kraken/kraken_summary
2953f382ccc13c34874939873f5f3535  ./888_PD18810b/qc/kraken/kraken_summary
root@51853b033dc8:/data/livermet/proj/final#