bcbio / bcbio-nextgen-vm

Run bcbio-nextgen genomic sequencing analyses using isolated containers and virtual machines
MIT License
65 stars 17 forks source link

Ensemble stage process failed #150

Closed mortunco closed 8 years ago

mortunco commented 8 years ago

Dear Brad;

I tried running final template on bam files. Actually I think it did pretty well. However, during the ensemble phase I got an error and pipe got broken. I believe this problem might be related with the additional output which are i think a some other way of describing chromosomes. I believe ensemble algorithm fails at that step. I have those GL00.. files in all of my callers outputs'. Have you ever come across with this kind of issue?

ubuntu@frontend001:/encrypted/project10/work/freebayes$ ls -l
total 3445728
drwxr-xr-x 3 ubuntu ubuntu      12288 Apr 18 17:43 1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:00 10
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:25 11
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:34 12
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:27 13
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:53 14
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 15
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:36 16
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:49 17
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:49 18
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:51 19
drwxr-xr-x 3 ubuntu ubuntu      12288 Apr 18 17:58 2
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 20:04 20
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:57 21
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 20:02 22
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 18:10 3
drwxr-xr-x 3 ubuntu ubuntu      12288 Apr 18 18:14 4
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 18:24 5
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 18:30 6
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 18:49 7
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 18:51 8
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 18:52 9
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000191.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:43 GL000192.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:43 GL000193.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:43 GL000194.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:44 GL000195.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000196.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000197.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000198.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:41 GL000199.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:41 GL000200.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000201.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000202.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000203.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000204.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:42 GL000205.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000206.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:36 GL000207.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:40 GL000208.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:39 GL000209.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:36 GL000210.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:40 GL000211.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:42 GL000212.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:39 GL000213.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:41 GL000214.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:40 GL000215.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:41 GL000216.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:40 GL000217.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:40 GL000218.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:41 GL000219.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 20:05 GL000220.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:39 GL000221.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:42 GL000222.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:41 GL000223.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:42 GL000224.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:45 GL000225.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:39 GL000226.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000227.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:39 GL000228.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:36 GL000229.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000230.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:36 GL000231.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000232.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000233.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000234.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000235.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000236.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000237.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000238.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:36 GL000239.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000240.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000241.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000242.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:38 GL000243.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000244.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000245.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000246.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000247.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000248.1
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:37 GL000249.1
-rw-r--r-- 1 ubuntu ubuntu  581575268 Apr 20 11:14 ICGC-effects-filter-germline.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu    1817514 Apr 20 11:14 ICGC-effects-filter-germline.vcf.gz.tbi
-rw-r--r-- 1 ubuntu ubuntu  582366667 Apr 20 10:58 ICGC-effects-filter.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu    1819300 Apr 20 10:58 ICGC-effects-filter.vcf.gz.tbi
-rw-r--r-- 1 ubuntu ubuntu   23025269 Apr 20 10:50 ICGC-effects-stats.genes.txt
-rw-r--r-- 1 ubuntu ubuntu     568236 Apr 20 10:50 ICGC-effects-stats.html
-rw-r--r-- 1 ubuntu ubuntu         15 Apr 20 10:56 ICGC-effects-stats.yaml
-rw-r--r-- 1 ubuntu ubuntu 1905041742 Apr 20 10:50 ICGC-effects.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu    1871340 Apr 20 10:51 ICGC-effects.vcf.gz.tbi
-rw-r--r-- 1 ubuntu ubuntu      13939 Apr 20 09:57 ICGC-files.list
-rw-r--r-- 1 ubuntu ubuntu  428258804 Apr 20 09:59 ICGC.vcf.gz
-rw-r--r-- 1 ubuntu ubuntu    1650055 Apr 20 09:59 ICGC.vcf.gz.tbi
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:35 MT
drwxr-xr-x 2 ubuntu ubuntu       4096 Apr 20 11:14 tx
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:48 X
drwxr-xr-x 3 ubuntu ubuntu       4096 Apr 18 19:43 Y
  1. If I start another run in this folder, with the following command bcbio -n 36 s3://tuncproject/icgcproject/input/deneme8/.yaml ( this is the exact command that brought me to his point.) Will is start from the ensemble step ? or will it start from the beginning? This is my final situation.
drwxr-xr-x  4 ubuntu ubuntu 4096 Apr 18 14:02 align
-rw-rw-r--  1 ubuntu ubuntu 1221 Apr 18 07:23 bcbio_sample-forvm.yaml
-rw-r--r--  1 ubuntu ubuntu  991 Apr 18 07:23 bcbio_system-forvm-merged.yaml
-rw-rw-r--  1 ubuntu ubuntu  991 Apr 18 07:23 bcbio_system-forvm.yaml
-rw-r--r--  1 ubuntu ubuntu  970 Apr 18 07:23 bcbio_system-merged.yaml
drwxr-xr-x  3 ubuntu ubuntu 4096 Apr 18 14:40 bedprep
drwxr-xr-x  2 ubuntu ubuntu 4096 Apr 20 09:59 checkpoints_parallel
drwxrwxr-x  2 ubuntu ubuntu 4096 Apr 20 20:05 config
drwxr-xr-x  3 ubuntu ubuntu 4096 Apr 20 12:19 ensemble
drwxr-xr-x 87 ubuntu ubuntu 4096 Apr 20 11:14 freebayes
drwxr-xr-x  3 ubuntu ubuntu 4096 Apr 20 13:01 gemini
drwxr-xr-x  5 ubuntu ubuntu 4096 Apr 18 07:40 inputs
drwxrwxr-x  2 ubuntu ubuntu 4096 Apr 18 07:23 log
drwxr-xr-x 87 ubuntu ubuntu 4096 Apr 20 11:39 mutect
drwxr-xr-x  4 ubuntu ubuntu 4096 Apr 18 09:53 prealign
drwxr-xr-x  2 ubuntu ubuntu 4096 Apr 18 07:42 provenance
drwxr-xr-x  3 ubuntu ubuntu 4096 Apr 18 14:40 regions
drwxr-xr-x  4 ubuntu ubuntu 4096 Apr 20 12:21 structural
drwxr-xr-x  2 ubuntu ubuntu 4096 Apr 20 13:01 tx
drwxr-xr-x 87 ubuntu ubuntu 4096 Apr 20 10:33 vardict
drwxr-xr-x 87 ubuntu ubuntu 4096 Apr 20 12:19 varscan

My error related ensemble; bcbio called the following command;

[2016-04-20T13:01Z] gunzip -c /mnt/work/gemini/ICGC-ensemble.vcf.gz | sed 's/ID=AD,Number=./ID=AD,Number=R/' | vt decompose -s - | awk '{ gsub("./-65", "./."); print $0 }'| bgzip -c > /mnt/work/gemini/tx/tmpwuH9Cf/ICGC-ensemble-decompose.vcf.gz

then I got the following error.

[2016-04-20T12:19Z] Timing: ensemble calling
[2016-04-20T12:19Z] multiprocessing: combine_calls
[2016-04-20T12:19Z] Ensemble consensus calls for ICGC: mutect,vardict,varscan,freebayes
[2016-04-20T12:19Z] Ensemble intersection calling: ICGC
[2016-04-20T12:19Z] /usr/local/share/bcbio-nextgen/anaconda/bin/bcbio-variation-recall: line 6: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
[2016-04-20T12:19Z] 2016-04-20 12:19:10 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:19Z] 2016-04-20 12:19:44 frontend001 INFO [bcbio.run.itx] - [vcf.c:1764 _vcf_parse_format] Number of columns at GL000207.1:41 does not match the number of samples (2 vs 4).
[2016-04-20T12:19Z] 2016-04-20 12:19:44 frontend001 INFO [bcbio.run.itx] - [vcf.c:1764 _vcf_parse_format] Number of columns at GL000207.1:137 does not match the number of samples (2 vs 4).
[2016-04-20T12:19Z] 2016-04-20 12:19:44 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:19Z] 2016-04-20 12:19:44 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:19Z] 2016-04-20 12:19:44 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:20Z] 2016-04-20 12:20:36 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:20Z] 2016-04-20 12:20:36 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:20Z] 2016-04-20 12:20:36 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:13 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:13 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:14 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:47 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:47 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:48 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:48 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:50 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:50 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:50 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:50 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:50 frontend001 INFO [bcbio.run.itx] - bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
[2016-04-20T12:21Z] 2016-04-20 12:21:51 frontend001 INFO [bcbio.run.itx] - Note: -w option not given, printing list of sites...
[2016-04-20T12:21Z] Timing: validation summary
[2016-04-20T12:21Z] Timing: structural variation final
[2016-04-20T12:21Z] Timing: structural variation ensemble
[2016-04-20T12:21Z] Timing: structural variation validation
[2016-04-20T12:21Z] multiprocessing: validate_sv
[2016-04-20T12:21Z] Timing: heterogeneity
[2016-04-20T12:21Z] Timing: population database
[2016-04-20T12:21Z] multiprocessing: prep_gemini_db
[2016-04-20T12:21Z] Multi-allelic to single allele
[2016-04-20T12:21Z] decompose v0.5
[2016-04-20T12:21Z] 
[2016-04-20T12:21Z] options:     input VCF file        -
[2016-04-20T12:21Z]          [s] smart decomposition   true (experimental)
[2016-04-20T12:21Z]          [o] output VCF file       -
[2016-04-20T12:21Z] 
[2016-04-20T12:23Z] 
[2016-04-20T12:23Z] stats: no. variants                 : 4158207
[2016-04-20T12:23Z]        no. biallelic variants       : 4150166
[2016-04-20T12:23Z]        no. multiallelic variants    : 8041
[2016-04-20T12:23Z] 
[2016-04-20T12:23Z]        no. additional biallelics    : 8155
[2016-04-20T12:23Z]        total no. of biallelics      : 4166362
[2016-04-20T12:23Z] 
[2016-04-20T12:23Z] Time elapsed: 1m 16s
[2016-04-20T12:23Z] 
[2016-04-20T12:23Z] tabix index ICGC-freebayes-decompose.vcf.gz
[2016-04-20T12:23Z] snpEff effects : DO51159-Tumor
[2016-04-20T12:23Z] /usr/local/share/bcbio-nextgen/anaconda/bin/snpEff: line 6: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
[2016-04-20T12:41Z] tabix index ICGC-freebayes-decompose-effects.vcf.gz
[2016-04-20T12:41Z] Multi-allelic to single allele
[2016-04-20T12:41Z] decompose v0.5
[2016-04-20T12:41Z] 
[2016-04-20T12:41Z] options:     input VCF file        -
[2016-04-20T12:41Z]          [s] smart decomposition   true (experimental)
[2016-04-20T12:41Z]          [o] output VCF file       -
[2016-04-20T12:41Z] 
[2016-04-20T12:41Z] [W::vcf_parse] contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:41Z] [W::vcf_parse] contig '2' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '3' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '4' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '5' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '6' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '7' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '8' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '9' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '10' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '11' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '12' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:42Z] [W::vcf_parse] contig '13' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig '14' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig '15' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig '16' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig '17' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig '18' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig '19' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig '20' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig '21' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig '22' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'X' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'Y' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000207.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000226.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000229.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000231.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000210.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000239.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000235.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000201.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000247.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000245.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000203.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000246.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000249.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000196.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000248.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000244.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000238.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000202.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000234.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000232.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000206.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000240.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000236.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000241.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000243.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000242.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000230.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000237.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000233.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000204.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000198.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000208.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000191.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000227.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000228.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000214.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000221.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000209.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000218.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000220.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000213.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000211.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000199.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000217.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000216.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000215.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000205.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000219.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000224.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000223.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000195.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000212.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000222.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000200.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000193.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000194.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000225.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] [W::vcf_parse] contig 'GL000192.1' is not defined in the header. (Quick workaround: index the file with tabix.)
[2016-04-20T12:43Z] 
[2016-04-20T12:43Z] stats: no. variants                 : 4212926
[2016-04-20T12:43Z]        no. biallelic variants       : 4212926
[2016-04-20T12:43Z]        no. multiallelic variants    : 0
[2016-04-20T12:43Z] 
[2016-04-20T12:43Z]        no. additional biallelics    : 0
[2016-04-20T12:43Z]        total no. of biallelics      : 4212926
[2016-04-20T12:43Z] 
[2016-04-20T12:43Z] Time elapsed: 1m 24s
[2016-04-20T12:43Z] 
[2016-04-20T12:43Z] tabix index ICGC-vardict-decompose.vcf.gz
[2016-04-20T12:43Z] snpEff effects : DO51159-Tumor
[2016-04-20T12:43Z] /usr/local/share/bcbio-nextgen/anaconda/bin/snpEff: line 6: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
[2016-04-20T13:01Z] tabix index ICGC-vardict-decompose-effects.vcf.gz
[2016-04-20T13:01Z] Multi-allelic to single allele
[2016-04-20T13:01Z] decompose v0.5
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z] options:     input VCF file        -
[2016-04-20T13:01Z]          [s] smart decomposition   true (experimental)
[2016-04-20T13:01Z]          [o] output VCF file       -
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z] [vcf.c:1764 _vcf_parse_format] Number of columns at 1:65745 does not match the number of samples (2 vs 4).
[2016-04-20T13:01Z] sed: couldn't write 649 items to stdout: Broken pipe
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z] stats: no. variants                 : 0
[2016-04-20T13:01Z]        no. biallelic variants       : 0
[2016-04-20T13:01Z]        no. multiallelic variants    : 0
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z]        no. additional biallelics    : 0
[2016-04-20T13:01Z]        total no. of biallelics      : 0
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z] Time elapsed: 0.00s
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z] gzip: stdout: Broken pipe
[2016-04-20T13:01Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; gunzip -c /mnt/work/gemini/ICGC-ensemble.vcf.gz | sed 's/ID=AD,Number=./ID=AD,Number=R/' | vt decompose -s - | awk '{ gsub("./-65", "./."); print $0 }'| bgzip -c > /mnt/work/gemini/tx/tmpwuH9Cf/ICGC-ensemble-decompose.vcf.gz
decompose v0.5

options:     input VCF file        -
         [s] smart decomposition   true (experimental)
         [o] output VCF file       -

[vcf.c:1764 _vcf_parse_format] Number of columns at 1:65745 does not match the number of samples (2 vs 4).
sed: couldn't write 649 items to stdout: Broken pipe

stats: no. variants                 : 0
       no. biallelic variants       : 0
       no. multiallelic variants    : 0

       no. additional biallelics    : 0
       total no. of biallelics      : 0

Time elapsed: 0.00s

gzip: stdout: Broken pipe
' returned non-zero exit status 4
Uncaught exception occurred
Traceback (most recent call last):
  File "/home/ubuntu/install/bcbio-vm/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/ubuntu/install/bcbio-vm/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'docker attach --no-stdin 18e095a3bd08e340d41c74514ddafc6c95d0ed488e238c4d8ad84e728602db36
[2016-04-20T13:01Z]        no. multiallelic variants    : 0
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z]        no. additional biallelics    : 0
[2016-04-20T13:01Z]        total no. of biallelics      : 0
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z] Time elapsed: 0.00s
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z] 
[2016-04-20T13:01Z] gzip: stdout: Broken pipe
[2016-04-20T13:01Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; gunzip -c /mnt/work/gemini/ICGC-ensemble.vcf.gz | sed 's/ID=AD,Number=./ID=AD,Number=R/' | vt decompose -s - | awk '{ gsub("./-65", "./."); print $0 }'| bgzip -c > /mnt/work/gemini/tx/tmpwuH9Cf/ICGC-ensemble-decompose.vcf.gz
decompose v0.5

options:     input VCF file        -
         [s] smart decomposition   true (experimental)
         [o] output VCF file       -

[vcf.c:1764 _vcf_parse_format] Number of columns at 1:65745 does not match the number of samples (2 vs 4).
sed: couldn't write 649 items to stdout: Broken pipe

stats: no. variants                 : 0
       no. biallelic variants       : 0
       no. multiallelic variants    : 0

       no. additional biallelics    : 0
       total no. of biallelics      : 0

Time elapsed: 0.00s

gzip: stdout: Broken pipe
' returned non-zero exit status 4
Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 226, in <module>
    main(**kwargs)
  File "/usr/local/bin/bcbio_nextgen.py", line 43, in main
    run_main(**kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
    fc_dir, run_info_yaml)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 181, in variant2pipeline
    samples = population.prep_db_parallel(samples, run_parallel)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/population.py", line 281, in prep_db_parallel
    output = parallel_fn("prep_gemini_db", to_process)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 180, in __init__
    self.results = batch()
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
    return apply(f, *args, **kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 239, in prep_gemini_db
    return population.prep_gemini_db(*args)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/population.py", line 29, in prep_gemini_db
    gemini_vcf = multiallelic.to_single(multisample_vcf, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/multiallelic.py", line 37, in to_single
    ready_ma_file = _decompose(in_file, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/multiallelic.py", line 58, in _decompose
    do.run(cmd % (in_file, tx_out_file), "Multi-allelic to single allele")
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; gunzip -c /mnt/work/gemini/ICGC-ensemble.vcf.gz | sed 's/ID=AD,Number=./ID=AD,Number=R/' | vt decompose -s - | awk '{ gsub("./-65", "./."); print $0 }'| bgzip -c > /mnt/work/gemini/tx/tmpwuH9Cf/ICGC-ensemble-decompose.vcf.gz
decompose v0.5

options:     input VCF file        -
         [s] smart decomposition   true (experimental)
         [o] output VCF file       -

[vcf.c:1764 _vcf_parse_format] Number of columns at 1:65745 does not match the number of samples (2 vs 4).
sed: couldn't write 649 items to stdout: Broken pipe

stats: no. variants                 : 0
       no. biallelic variants       : 0
       no. multiallelic variants    : 0

       no. additional biallelics    : 0
       total no. of biallelics      : 0

Time elapsed: 0.00s

gzip: stdout: Broken pipe
' returned non-zero exit status 4
' returned non-zero exit status 1

If I pass this step, hopefully I won't bother you for a while. I owe you too much !

Best regards,

Tunc.

chapmanb commented 8 years ago

Tunc; Sorry about the problem. It looks like the ensemble process finished okay and you're at the stage of creating GEMINI databases. If you don't need these, you can turn off gemini in the algorithm section of your samples (http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#post-processing):

algorithm:
   tools_off: [gemini]

Hopefully that'll finish the processing cleanly.

Regarding the underlying error, it looks like something is problematic about the created ensemble file. The informative error is:

Number of columns at 1:65745 does not match the number of samples (2 vs 4).

If you look at the ensemble file (/mnt/work/gemini/ICGC-ensemble.vcf.gz), is there anything wrong with this position (chromosome 1, position 65745). If this is the same configuration as previously it seems like you should have 2 samples (tumor/normal) but instead have 4 so something is off there.

Hope this helps with finishing and debugging if you want to look at it more. Thanks for all your patience getting this running.

mortunco commented 8 years ago

Dear Brad;

Thank you for the explanation. I am trying to figure out that sample problem. Also,

About my input file is; So there are two samples. Just a quick conformation, I have my bam indexes in my input folder in s3. Do you think it might be related with that problem ? Although I am sure about those indexes are always asked, can this last step be originated from this thing?

ubuntu@frontend001:/encrypted/project10/work/inputs/tuncproject/icgcrun/input$ ls
normal.bam  tumor.bam  tx

I checked that specific location and I couldn't understand if there is info. What should I be seeing as an error ?


ubuntu@frontend001:/encrypted/project10/work/ensemble$ cat ICGC-ensemble.vcf | grep 65745
1   65745   .   A   G   61  PASS    ANN=G|upstream_gene_variant|MODIFIER|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding||c.-3346A>G|||||3346|,G|downstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|transcript|ENST00000492842|unprocessed_pseudogene||n.*1858A>G|||||1858|,G|intergenic_region|MODIFIER|OR4G11P-OR4F5|ENSG00000240361-ENSG00000186092|intergenic_region|ENSG00000240361-ENSG00000186092|||n.65745A>G||||||;CALLERS=vardict,varscan;LSEQ=GTTAACTTAAATACTTAGAG;MSI=3;MSILEN=1;RSEQ=AATTAATATGAATAATGTTA;SAMPLE=DO51159-Tumor;SHIFT3=3;SOMATIC;SOR=0;SSF=0.51648;STATUS=StrongSomatic;TYPE=SNV  GT:AD:ADJAF:AF:ALD:BIAS:DP:HIAF:MQ:NM:ODDRATIO:PMEAN:PSTD:QSTD:QUAL:RD:SBF:SN:VD    0/0:4,0:0:0:0,0:0,0:4:1:27:0:0:28.8:1:1:39.3:4,0:1:8:0  0/1:8,3:0:0.2727:3,0:0,0:11:0.2727:25:1:0:33.3:1:1:38.7:8,0:1:6:3

Lastly, as you may see, I specified only two samples for the run. Is there a mistake about file targeting?

details:
- algorithm:
    aligner: false
    ensemble:
      numpass: 2
    indelcaller: scalpel
    platform: illumina
    quality_format: illumina
    realign: false
    recalibrate: false
    remove_lcr: true
    variantcaller:
    - mutect
    - vardict
    - varscan
    - freebayes
  analysis: variant2
  description: DO51159-Normal
  files:
  - s3://tuncproject/icgcrun/input/normal.bam
  genome_build: GRCh37
  metadata:
    batch: ICGC
    phenotype: normal
- algorithm:
    aligner: false
    ensemble:
      numpass: 2
    indelcaller: scalpel
    platform: illumina
    quality_format: illumina
    realign: false
    recalibrate: false
    remove_lcr: true
    variantcaller:
    - mutect
    - vardict
    - varscan
    - freebayes
  analysis: variant2
  description: DO51159-Tumor
  files:
  - s3://tuncproject/icgcrun/input/tumor.bam
  genome_build: GRCh37
  metadata:
    batch: ICGC
    phenotype: tumor
fc_date: '2015-04-14'
fc_name: ICGC-trials
resources:
  gatk:
    jar: s3://tuncproject/gatktools/GenomeAnalysisTK.jar
  mutect:
    jar: s3://tuncproject/gatktools/mutect-1.1.7.jar
upload:
  bucket: tuncproject
  dir: ../final
  folder: icgcrun/input/final
  method: s3
  region: us-east-1

Finally, Gemini seems very informative and benefits the final variants calls. Therefore, I am pleased to keep that information.

Thank you very much for your help,

Best,

Tunc.

chapmanb commented 8 years ago

Tunc; Thanks for the help debugging this. This line looks correct to me, so it might be that a previous line is signaling there should be 4 samples, while this line has the correct 2 samples. If you look at the same file. what does the sample header look like:

zgrep ^#CHROM /mnt/work/gemini/ICGC-ensemble.vcf.gz

Based on the error message, my guess is one of the calling methods has 4 samples in the #CHROM line and this is causing the confusion. I'm specifically worried that you have aligner: false and don't have bam_clean: picard set. The bam_clean step (or running an alignment) handles correctly setting the read group information in the BAM file to match the sample names you specified (DO51159-Tumor and DO51159-Normal). Did you manually correct the BAM files prior to feeding to bcbio? If not we might have ended up with two sample names: the original with whatever was in the BAM file read group, and the bcbio specified one (when it corrects callers that don't use read group information and output generic names like TUMOR/NORMAL).

This might also help identify if something looks wrong in one of the outputs:

ls */*-effects.vcf.gz | xargs zgrep ^#CHROM

All of the sample lines should have two sample names (DO51159-Tumor and DO51159-Normal).

Hope this helps.

mortunco commented 8 years ago

This is the output of that specific command you suggested.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  140e5014-bdd6-4663-9404-234c7f9e927d    DO51159-Normal  DO51159-Tumor   a9ec7d9e-b179-4782-a589-43c7d1642be9

I checked what are those entries. The one that starts with "a9ec.." it is one of the broad output names. But I dont know why it is in the file. I couldnt find any entries to the first one.

Like this; a9ec7d9e-b179-4782-a589-43c7d1642be9.broad-snowman.20151107.germline.indel.vcf.gz

I think you are TOTALLY right about your guess. It is the group names. My entries related to TUMOR/NORMAL in the configuration file caused this problem.

O mai god @chapmanb, you were totally right about the problem. THANK YOU THANK YOU!!!

ubuntu@frontend001:/encrypted/project10/work$  ls */*-effects.vcf.gz | xargs zgrep ^#CHROM
freebayes/ICGC-effects.vcf.gz:#CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  a9ec7d9e-b179-4782-a589-43c7d1642be9    140e5014-bdd6-4663-9404-234c7f9e927d
gemini/ICGC-freebayes-decompose-effects.vcf.gz:#CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  a9ec7d9e-b179-4782-a589-43c7d1642be9    140e5014-bdd6-4663-9404-234c7f9e927d
gemini/ICGC-vardict-decompose-effects.vcf.gz:#CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  DO51159-Tumor   DO51159-Normal
mutect/ICGC-effects.vcf.gz:#CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  140e5014-bdd6-4663-9404-234c7f9e927d    DO51159-Normal  DO51159-Tumor   a9ec7d9e-b179-4782-a589-43c7d1642be9
vardict/ICGC-effects.vcf.gz:#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  DO51159-Tumor   DO51159-Normal
varscan/ICGC-effects.vcf.gz:#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  DO51159-Normal  DO51159-Tumor

So, should remove that DO51159-Normal/Tumor from my configuration file ? or could you please guide me on which editing should I do next to fix this problem? Should there be a option where it uses the default which basically takes from the header of that bam?

One last question,

In order to finalise this run successfully, can I bcbio_vm.py run -n ... and continue where was I left off with the updated configuration file ? or it starts calling all the mutations again? When I have alinged files I know that bcbio continue from that point but is it same for the variant calling ?

Thank you very much for identifying the problem.

Best regards,

Tunc.

chapmanb commented 8 years ago

Tunc; Thanks for confirming this is the issue and sorry about the problem. I will add some code in bcbio to check for this case and warn you before you get to this late stage and have mismatched sample names.

Fixing will unfortunately require some re-running. You have two choices. The first is to use the ICGC names (140e5014-bdd6-4663-9404-234c7f9e927d and a9ec7d9e-b179-4782-a589-43c7d1642be9). You can edit your sample YAML file to replace these with your custom names (D051159) and you'll have to figure out which is tumor and which is normal by looking at the input BAM fijles (samtools view -H tumor.bam | grep RG). After doing that you'd need to do:

rm -rf prealign checkpoints_parallel gemini ensemble
rm -rf mutect vardict varscan

The second option is to have bcbio fix the header by adding bam_clean: picard or do the alignment by adding aligner: bwa. Then you'd need to do:

rm -rf prealign checkpoints_parallel gemini ensemble
rm -rf mutect freebayes

And in both cases re-run from there. Sorry about the problem and hope this helps get your analysis finished.

chapmanb commented 8 years ago

Tunc; I looked into the code and am not sure how your run circumvents our checks for mismatched sample names. We have a check for your problem and should have caught it up front, but did not somehow. Could you report the output of:

samtools view -H inputs/tuncproject/icgcrun/input/tumor.bam | grep ^@RG
samtools view -H inputs/tuncproject/icgcrun/input/normal.bam | grep ^@RG

and I'll try to dig more to see what we're missing that prevents detecting this error at the start of a run. Thanks again for the help debugging.

mortunco commented 8 years ago

Brad;

Here is the information that you asked. I figured out which SM tag is belong to which sample. I think the problem caused by the decription part. Apparently, all the other callers but freebayes took description in put as sample name and tried to process it as a different sample. ( I learned this because sometimes different libraries can be used for sequencing thus they are labeled by different SM tags to distinguish during alignment/ mutation calling. ) You should be look at SM in @RG

If I did not specify description input, what would happen? Would callers use the original SM tag information in the bam files??

Could you inform me regarding this information after you get done with your task ? Because I dont want to manually check @RG line of all of my bams manually for 214 samples?

I am really happy to contribute bcbio :)

[ec2-user@ip-172-31-57-166 input]$ samtools view -H tumor.bam | grep SM | grep SM
@RG ID:OICR:CPCG_0414_Pr_P_PE_607_WG.1  PL:ILLUMINA CN:OICR PI:607  DT:2014-03-27T00:00:00+06:00    LB:WGS:OICR:CPCG_0414_Pr_P_PE_607_WG    SM:a9ec7d9e-b179-4782-a589-43c7d1642be9 PU:OICR:140327_SN801_0162_BC3RMVACXX_8  PG:fastqtobam
@RG ID:OICR:CPCG_0414_Pr_P_PE_607_WG.2  PL:ILLUMINA CN:OICR PI:607  DT:2014-04-11T00:00:00+06:00    LB:WGS:OICR:CPCG_0414_Pr_P_PE_607_WG    SM:a9ec7d9e-b179-4782-a589-43c7d1642be9 PU:OICR:140411_SN203_0228_BC3R9VACXX_8  PG:fastqtobam
@RG ID:OICR:CPCG_0414_Pr_P_PE_620_WG.1  PL:ILLUMINA CN:OICR PI:620  DT:2014-04-03T00:00:00+06:00    LB:WGS:OICR:CPCG_0414_Pr_P_PE_620_WG    SM:a9ec7d9e-b179-4782-a589-43c7d1642be9 PU:OICR:140403_h803_0182_AC4CT6ACXX_1   PG:fastqtobam
@RG ID:OICR:CPCG_0414_Pr_P_PE_620_WG.2  PL:ILLUMINA CN:OICR PI:620  DT:2014-04-29T00:00:00+06:00    LB:WGS:OICR:CPCG_0414_Pr_P_PE_620_WG    SM:a9ec7d9e-b179-4782-a589-43c7d1642be9 PU:OICR:140429_SN804_0189_AC4DR3ACXX_1  PG:fastqtobam
@RG ID:OICR:CPCG_0414_Pr_P_PE_630_WG.1  PL:ILLUMINA CN:OICR PI:630  DT:2014-04-03T00:00:00+06:00    LB:WGS:OICR:CPCG_0414_Pr_P_PE_630_WG    SM:a9ec7d9e-b179-4782-a589-43c7d1642be9 PU:OICR:140403_h803_0182_AC4CT6ACXX_2   PG:fastqtobam
@RG ID:OICR:CPCG_0414_Pr_P_PE_630_WG.2  PL:ILLUMINA CN:OICR PI:630  DT:2014-04-29T00:00:00+06:00    LB:WGS:OICR:CPCG_0414_Pr_P_PE_630_WG    SM:a9ec7d9e-b179-4782-a589-43c7d1642be9 PU:OICR:140429_SN804_0189_AC4DR3ACXX_2  PG:fastqtobam
@PG CL:/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/bin/PCAP-core-1.1.1/bin/bwa mem -t 15 -p -T 0 -R @RG\tID:OICR:CPCG_0414_Pr_P_PE_607_WG.1\tCN:OICR\tDT:2014-03-27T00:00:00+06:00\tLB:WGS:OICR:CPCG_0414_Pr_P_PE_607_WG\tPG:fastqtobam\tPI:607\tPL:ILLUMINA\tPU:OICR:140327_SN801_0162_BC3RMVACXX_8\tSM:a9ec7d9e-b179-4782-a589-43c7d1642be9 /mnt/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/data/reference/bwa-0.6.2/genome.fa.gz -   ID:bwa_0    PN:bwa  VN:0.7.8-r455
@PG CL:/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/bin/PCAP-core-1.1.1/bin/bwa mem -t 15 -p -T 0 -R @RG\tID:OICR:CPCG_0414_Pr_P_PE_630_WG.2\tCN:OICR\tDT:2014-04-29T00:00:00+06:00\tLB:WGS:OICR:CPCG_0414_Pr_P_PE_630_WG\tPG:fastqtobam\tPI:630\tPL:ILLUMINA\tPU:OICR:140429_SN804_0189_AC4DR3ACXX_2\tSM:a9ec7d9e-b179-4782-a589-43c7d1642be9 /mnt/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/data/reference/bwa-0.6.2/genome.fa.gz -   ID:bwa_1    PN:bwa  PP:bamsort_0    VN:0.7.8-r455
@PG CL:/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/bin/PCAP-core-1.1.1/bin/bwa mem -t 15 -p -T 0 -R @RG\tID:OICR:CPCG_0414_Pr_P_PE_620_WG.2\tCN:OICR\tDT:2014-04-29T00:00:00+06:00\tLB:WGS:OICR:CPCG_0414_Pr_P_PE_620_WG\tPG:fastqtobam\tPI:620\tPL:ILLUMINA\tPU:OICR:140429_SN804_0189_AC4DR3ACXX_1\tSM:a9ec7d9e-b179-4782-a589-43c7d1642be9 /mnt/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/data/reference/bwa-0.6.2/genome.fa.gz -   ID:bwa_2    PN:bwa  PP:bamsort_1    VN:0.7.8-r455
@PG CL:/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/bin/PCAP-core-1.1.1/bin/bwa mem -t 15 -p -T 0 -R @RG\tID:OICR:CPCG_0414_Pr_P_PE_620_WG.1\tCN:OICR\tDT:2014-04-03T00:00:00+06:00\tLB:WGS:OICR:CPCG_0414_Pr_P_PE_620_WG\tPG:fastqtobam\tPI:620\tPL:ILLUMINA\tPU:OICR:140403_h803_0182_AC4CT6ACXX_1\tSM:a9ec7d9e-b179-4782-a589-43c7d1642be9 /mnt/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/data/reference/bwa-0.6.2/genome.fa.gz -    ID:bwa_3    PN:bwa  PP:bamsort_2    VN:0.7.8-r455
@PG CL:/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/bin/PCAP-core-1.1.1/bin/bwa mem -t 15 -p -T 0 -R @RG\tID:OICR:CPCG_0414_Pr_P_PE_607_WG.2\tCN:OICR\tDT:2014-04-11T00:00:00+06:00\tLB:WGS:OICR:CPCG_0414_Pr_P_PE_607_WG\tPG:fastqtobam\tPI:607\tPL:ILLUMINA\tPU:OICR:140411_SN203_0228_BC3R9VACXX_8\tSM:a9ec7d9e-b179-4782-a589-43c7d1642be9 /mnt/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/data/reference/bwa-0.6.2/genome.fa.gz -   ID:bwa_4    PN:bwa  PP:bamsort_3    VN:0.7.8-r455
@PG CL:/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/bin/PCAP-core-1.1.1/bin/bwa mem -t 15 -p -T 0 -R @RG\tID:OICR:CPCG_0414_Pr_P_PE_630_WG.1\tCN:OICR\tDT:2014-04-03T00:00:00+06:00\tLB:WGS:OICR:CPCG_0414_Pr_P_PE_630_WG\tPG:fastqtobam\tPI:630\tPL:ILLUMINA\tPU:OICR:140403_h803_0182_AC4CT6ACXX_2\tSM:a9ec7d9e-b179-4782-a589-43c7d1642be9 /mnt/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/data/reference/bwa-0.6.2/genome.fa.gz -    ID:bwa_5    PN:bwa  PP:bamsort_4    VN:0.7.8-r455
ec2-user@ip-172-31-57-166 input]$ samtools view -H normal.bam | grep SM
@RG ID:OICR:CPCG_0414_Ly_R_PE_000_WG.1  PL:ILLUMINA CN:OICR PI:000  DT:2014-03-05T00:00:00+06:00    LB:WGS:OICR:CPCG_0414_Ly_R_PE_000_WG    SM:140e5014-bdd6-4663-9404-234c7f9e927d PU:OICR:140305_SN1017_0409_BC3VKFACXX_4 PG:fastqtobam
@RG ID:OICR:CPCG_0414_Ly_R_PE_000_WG.2  PL:ILLUMINA CN:OICR PI:000  DT:2014-03-05T00:00:00+06:00    LB:WGS:OICR:CPCG_0414_Ly_R_PE_000_WG    SM:140e5014-bdd6-4663-9404-234c7f9e927d PU:OICR:140305_SN1017_0409_BC3VKFACXX_5 PG:fastqtobam
@RG ID:OICR:CPCG_0414_Ly_R_PE_000_WG.3  PL:ILLUMINA CN:OICR PI:000  DT:2014-03-05T00:00:00+06:00    LB:WGS:OICR:CPCG_0414_Ly_R_PE_000_WG    SM:140e5014-bdd6-4663-9404-234c7f9e927d PU:OICR:140305_SN1017_0409_BC3VKFACXX_6 PG:fastqtobam
@PG CL:/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/bin/PCAP-core-1.1.1/bin/bwa mem -t 15 -p -T 0 -R @RG\tID:OICR:CPCG_0414_Ly_R_PE_000_WG.1\tCN:OICR\tDT:2014-03-05T00:00:00+06:00\tLB:WGS:OICR:CPCG_0414_Ly_R_PE_000_WG\tPG:fastqtobam\tPI:000\tPL:ILLUMINA\tPU:OICR:140305_SN1017_0409_BC3VKFACXX_4\tSM:140e5014-bdd6-4663-9404-234c7f9e927d /mnt/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_0    PN:bwa  VN:0.7.8-r455
@PG CL:/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/bin/PCAP-core-1.1.1/bin/bwa mem -t 15 -p -T 0 -R @RG\tID:OICR:CPCG_0414_Ly_R_PE_000_WG.2\tCN:OICR\tDT:2014-03-05T00:00:00+06:00\tLB:WGS:OICR:CPCG_0414_Ly_R_PE_000_WG\tPG:fastqtobam\tPI:000\tPL:ILLUMINA\tPU:OICR:140305_SN1017_0409_BC3VKFACXX_5\tSM:140e5014-bdd6-4663-9404-234c7f9e927d /mnt/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_1    PN:bwa  PP:bamsort_0    VN:0.7.8-r455
@PG CL:/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/bin/PCAP-core-1.1.1/bin/bwa mem -t 15 -p -T 0 -R @RG\tID:OICR:CPCG_0414_Ly_R_PE_000_WG.3\tCN:OICR\tDT:2014-03-05T00:00:00+06:00\tLB:WGS:OICR:CPCG_0414_Ly_R_PE_000_WG\tPG:fastqtobam\tPI:000\tPL:ILLUMINA\tPU:OICR:140305_SN1017_0409_BC3VKFACXX_6\tSM:140e5014-bdd6-4663-9404-234c7f9e927d /mnt/home/seqware/provisioned-bundles/Workflow_Bundle_BWA_2.6.1_SeqWare_1.1.0-alpha.5/Workflow_Bundle_BWA/2.6.1/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_2    PN:bwa  PP:bamsort_1    VN:0.7.8-r455
chapmanb commented 8 years ago

Tunc; Thanks for the read group details, this helped a lot in tracking down the underlying problem. In our check we only issue a warning if there are multiple read groups (which you have) but as a result didn't raise an error because the read group samples (SM) don't match the description. I fixed this so it'll error out on the mismatch up front instead of after and we're building a new Docker container soon that should be available in an hour or so and will check this going forward. Thanks for the help debugging the problem and hope the re-run and subsequent runs go smoother with this check in place.