bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

GATK4 GatherVcfs "first record in file ... is not after first record in previous file" #2420

Closed sxv closed 5 years ago

sxv commented 6 years ago

Hi,

I am excited to be using the new gatk4 workflow on a fresh bcbio development install. In this case I'm running a tumor/normal pair of targeted DNA-Seq from pre-aligned bam files. I've read mixed opinions about whether or not recalibration is recommended for gatk4, so advice on that would be helpful as well. Anyway, I'm getting the following error message:

[Wed Jun 20 10:54:08 PDT 2018] Executing on Linux 4.13.0-43-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Picard version: Version:4.0.3.0 INFO 2018-06-20 10:54:08 GatherVcfs Checking inputs. INFO 2018-06-20 10:54:08 GatherVcfs Checking file headers and first records to ensure compatibility. [Wed Jun 20 10:54:08 PDT 2018] picard.vcf.GatherVcfs done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=460259328 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp java.lang.IllegalArgumentException: First record in file /15/mutect2/chr1/prostate_lcm_15_19-chr1_109811749_142535913.vcf.gz is not after first record in previous file /15/mutect2/prostate_lcm_15_19-chrM_0_16571-fixheader.vcf.gz at picard.vcf.GatherVcfs.assertSameSamplesAndValidOrdering(GatherVcfs.java:128) at picard.vcf.GatherVcfs.doWork(GatherVcfs.java:74) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) ' returned non-zero exit status 3

Here are the first lines of the vcfs:

$ zcat /15/mutect2/chr1/prostate_lcm_15_19-chr1_109811749_142535913.vcf.gz | grep -v ^# | head -1

chr1 121484118 . G A . artifact_in_normal;clustered_events;mapping_quality ClippingRankSum=0;DP=176;ECNT=6;FS=18.372;MQ=30.56;MQ0=0;MQRankSum=0.498;NLOD=3.95;N_ART_LOD=1.28;POP_AF=5e-08;P_CONTAM=0;P_GERMLINE=-21.79;ReadPosRankSum=1.505;TLOD=19.33 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1:105,13:0.178:118:52,7:53,6:26:172,0:29:39:0.111,0,0.11:0.0009419,0.689,0.31 0/0:29,2:0.111:31:15,1:14,1:22:213,251:29:47:.:.

$ zcat /15/mutect2/prostate_lcm_15_19-chrM_0_16571-fixheader.vcf.gz | grep -v ^# | head -1

chrM 12169 . G C . MinAF;clustered_events;mapping_quality ClippingRankSum=0;DP=160;ECNT=5;FS=15.473;MQ=57.08;MQ0=0;MQRankSum=-6.637;NLOD=2.41;N_ART_LOD=-0.9542;POP_AF=5e-08;P_CONTAM=0;P_GERMLINE=-37.9;ReadPosRankSum=-2.453;TLOD=14.65 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB 0/1:145,6:0.043:151:80,5:65,1:25:196,251:29:9:0|1:12169_G_C:0.04,0,0.04:0.001676,0.149,0.849 0/0:8,0:3.162e-05:8:6,0:2,0:0:248,0:0:0:0|1:12169_G_C:.:.

Is the error because chr1 shouldn't follow chrM, or something else going on?

Here's my config:

---
fc_date: "20180619"
fc_name: "prostate_lcm.15_19"
upload:
  dir: /prostate.lcm/
resources:
  tmp:
    dir: /tmp
details:
  - analysis: variant2
    genome_build: hg19
    metadata:
       batch: prostate.lcm.15_19
       phenotype: tumor
    description: 15.tumor
    files: [PC_AP-14_CAGATC_15_1_2_T34.bam]
    algorithm:
      mark_duplicates: true
      recalibrate: true
      bam_clean: fixrg
      platform: illumina
      tools_off: [gemini]
      tools_on: gatk4
      variantcaller: mutect2
      variant_regions: ../PC_baits.target.bed
  - analysis: variant2
    genome_build: hg19
    metadata:
       batch: prostate.lcm.15_19
       phenotype: normal
    description: 19.normal
    files: [N_AP-14_ATGTCA_19_1_ST1_ST1.bam]
    algorithm:
      mark_duplicates: true
      recalibrate: true
      bam_clean: fixrg
      platform: illumina
      tools_off: [gemini]
      tools_on: gatk4
      variantcaller: mutect2
      variant_regions: ../PC_baits.target.bed

Thanks!

chapmanb commented 6 years ago

Sujay; Sorry about the issue here I'm not entirely sure what is going wrong. A couple of questions:

I'm wondering if there is some kind of out of order contig thing happening in the VCF header which is confusing GATK here.

As another check, what is the last record of the chrM split VCF file:

zcat /15/mutect2/prostate_lcm_15_19-chrM_0_16571-fixheader.vcf.gz | grep -v ^# | tail -1

Sorry to not know exactly what is happening, but hope this helps some with debugging.

lbeltrame commented 5 years ago

Got a similar problem here (with hg19, distributed with bcbio, aligned against the same hg19). This however occurs with RNA-seq variant calling (same issue on chromosome M):

The output on the problematic file is as follows.

chrM    15957   .       T       <NON_REF>       .       .       END=16023       GT:DP:GQ:MIN_DP:PL      0:478:99:92:0,1800
lbeltrame commented 5 years ago

It looks like the contigs are indeed in the wrong order.

"Right" GATK file from another project:

##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>

From this one:.

##contig=<ID=chr1,length=249250621>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr11_gl000202_random,length=40103>
chapmanb commented 5 years ago

Luca; Thanks for the report and sorry this is causing issues. Are you able to track down at all what is happening? We haven't seen this issue and I'm not sure how best to reproduce. It looks like your problem file is not sorted in GATK order but in numerical order, which typically indicates a differently sorted reference file somewhere. If you're able to provide more details or ideas about what might be happening happy to dig more.

lbeltrame commented 5 years ago

At first I thought we were using pre-aligned files, but after a look at the configuration, it started from FASTQ files, so they were all produced during the run, using the same assembly:

grep build RNA-Seq-complete.yaml | sort -u
  genome_build: hg19

For now I fixed the VCFs manually with UpdateVCFSequenceDictionary from the GATK and then replaced them in the work directory by bcbio.

davidvanzessen commented 5 years ago

Hi, I'm stuck on the same error:

ERROR   2019-05-01 12:15:33     GatherVcfs      There was a problem with gathering the INPUT.java.lang.IllegalArgumentException: First record in file /home/bioinf/bcbio/working/freebayes/chr2/Batch1-chr2_0_13657608.vcf.gz is not after first record in previous file /home/bioinf/bcbio/working/freebayes/chr1_GL456221_random/Batch1-chr1_GL456221_random_0_205860.vcf.gz

I made gists of the vcf files (with only the first record): https://gist.github.com/davidvanzessen/644924290d9afd89b19bd5b3ffad0434

My bcbio config yml:

details:
- algorithm:
    aligner: bwa
    mark_duplicates: true
    remove_lcr: true
    variantcaller: [freebayes,vardict,varscan]
    ensemble:
    numpass: 2
    align_split_size: false
analysis: variant2
lane: 1
description: Galaxy_1
files: [/home/bioinf/bcbio/input/normal_0_R1.fastq.gz, /home/bioinf/bcbio/input/normal_0_R2.fastq.gz]
genome_build: mm10
metadata:
    phenotype: normal
    batch: Batch1
- algorithm:
    aligner: bwa
    mark_duplicates: true
    remove_lcr: true
    variantcaller: [freebayes,vardict,varscan]
    ensemble:
    numpass: 2
    align_split_size: false
analysis: variant2
lane: 2
description: Galaxy_2
files: [/home/bioinf/bcbio/input/tumor_0_R1.fastq.gz, /home/bioinf/bcbio/input/tumor_0_R2.fastq.gz]
genome_build: mm10
metadata:
    phenotype: tumor
    batch: Batch1  

Hope someone can help, let me know if there is any more information I can provide.

chapmanb commented 5 years ago

Thanks for the detailed report and apologies about the issue. I'm not exactly sure how these are getting out of order, but it looks from the error message like the merge has sorted one of the extra chromosomes (chr1_GL456221_random) before a standard chromosome (chr2) causing the complaint. We pushed some debugging output that might help us isolate the issue if you're able to update to the latest development (bcbio_nextgen.py upgrade -u development), re-run in place and report the full error message with debug output that comes back as a gist. Additionally, if you look at your mm10 reference fasta index (/home/bioinf/bcbio_test/genomes/Mmusculus/mm10/seq/mm10.fa.fai), could you confirm the ordering of contigs in there matches what's in the VCF inputs?

Sorry again about the issues and thanks for the help debugging this.

davidvanzessen commented 5 years ago

The ordering of the contigs in mm10.fa.fai is different:
https://gist.github.com/davidvanzessen/ba021e7b69b7c346d0d1e13cbd3750ca

I've upgraded to the development branch and I'm rerunning the analysis, I'll post back here when it's done.

Thank you for taking the time to help solve this :+1:

chapmanb commented 5 years ago

Thank you for helping debug this. The fai file is the issue, as the contigs in there are out of order with respect to what is in the bwa indices and subsequent VCF files. This mismatch ends up with us trying to order the VCFs incorrectly and the error.

If this mm10 file got created with bcbio, I think the issue is that it's not using the pre-created mm10 downloads and instead trying to create and sort them as part of the install. I pushed a fix to download and use the pre-existing ones. If you update with:

mv /home/bioinf/bcbio_test/genomes/Mmusculus/mm10/seq /home/bioinf/bcbio_test/genomes/Mmusculus/mm10/seq.problem
bcbio_nextgen.py upgrade --data

I hope it'll pull the correct ones, then you can re-run your analysis in place without needing to change anything and it should hopefully do the right thing with sorting. Please let us know if you still hit any issues.

davidvanzessen commented 5 years ago

Renaming the seq directory and running bcbio_nextgen.py upgrade --data gives the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/bioinf/bcbio_test/genomes/Mmusculus/mm10/seq/mm10-resources.yaml'  

I tried just creating the seq directory with just mm10-resources.yaml, but it then instantly jumps to Upgrade completed successfully. without downloading anything new.

chapmanb commented 5 years ago

Thank you for testing this and apologies about the upgrade error. I made a mistake and forgot to wire in the new recipe for the seq directory so it didn't actually get run, and then it errors later when expecting to find that directory. Could you try the upgrade one more time with a fresh cloudbiolinux download:

rm -rf tmpbcbio-install
bcbio_nextgen.py upgrade --data

Hope this works cleanly for you and thank you again for the help debugging.

davidvanzessen commented 5 years ago

It worked, thank you so much for the help :+1:

xwang445 commented 1 year ago

I just met the same error with this command:

vcf_file_string="" for vcf in $(ls -1 newcontiglines.vcfchr*.vcf) ; do vcf_file_string="$vcf_file_string -I $vcf" done

concatenate VCF files

module load gatk/4.3.0.0 gatk GatherVcfs \ -R genome.fa $vcf_file_string -O dbSNP.vcf

then I ls the vcf files by numeric order, it works fine.

vcf_file_string="" for vcf in $(ls -1v newcontiglines.vcfchr*.vcf) ; do vcf_file_string="$vcf_file_string -I $vcf" done

concatenate VCF files

module load gatk/4.3.0.0 gatk GatherVcfs \ -R genome.fa $vcf_file_string -O dbSNP.vcf