Variant calling on RNA-seq data using GATK4

yfukasawa commented 4 years ago

Hello there,

thank you for developing and maintaining a great pipeline! I'm now testing a variant call option on an RNA-seq dataset. Although I have 3 replicates, for this testing I just use 1 of them. Hope this didn't cause the issue. I've also tried WGS and WES, but didn't hit the below issue using 1.1.6a.

Version info

bcbio version (bcbio_nextgen.py --version): 1.1.6a and 1.2.0
GATK (4.1.3 for 1.1.6a and 4.1.6 for 1.2.0)
OS name and version: CentOS 7
Runs on a cluster using slurm. I get a single node using srun, and runs in a local mode.

To Reproduce

Exact bcbio command you have used: bcbio_nextgen.py ../config/rnaseq_var_test.yml -n 18
Your sample configuration file:

details:

analysis: RNA-seq genome_build: hg38 description: hs4k_p1 files:

/path/to/read_1.fastq.gz

/path/to/read_2.fastq.gz metadata: batch: batch1 condition: hs4k_1 algorithm: aligner: star spikein_fasta: /path/to/ercc/ERCC92.fa strandedness: firststrand variantcaller: gatk-haplotype expression_caller: salmon upload: dir: ../final`

Observed behavior Error message or bcbio output. [2020-04-02T07:36Z] 20/04/02 10:36:26 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 43.0 KB, free 23.7 GB) [2020-04-02T07:36Z] 20/04/02 10:36:26 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.0 KB, free 23.7 GB) [2020-04-02T07:36Z] 20/04/02 10:36:26 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:33204 (size: 11.0 KB, free: 23.7 GB) [2020-04-02T07:36Z] 20/04/02 10:36:26 INFO SparkContext: Created broadcast 1 from broadcast at AbstractBinarySamSource.java:82 [2020-04-02T07:36Z] 20/04/02 10:36:26 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 307.4 KB, free 23.7 GB) [2020-04-02T07:36Z] 20/04/02 10:36:26 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 25.4 KB, free 23.7 GB) [2020-04-02T07:36Z] 20/04/02 10:36:26 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:33204 (size: 25.4 KB, free: 23.7 GB) [2020-04-02T07:36Z] 20/04/02 10:36:26 INFO SparkContext: Created broadcast 2 from newAPIHadoopFile at PathSplitSource.java:96 [2020-04-02T07:36Z] 10:36:26.924 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled [2020-04-02T07:36Z] 20/04/02 10:36:27 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 15.3 MB, free 23.7 GB) [2020-04-02T07:36Z] 20/04/02 10:36:27 INFO SparkUI: Stopped Spark web UI at http://localhost:4040 [2020-04-02T07:36Z] 20/04/02 10:36:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! [2020-04-02T07:36Z] 20/04/02 10:36:27 INFO MemoryStore: MemoryStore cleared [2020-04-02T07:36Z] 20/04/02 10:36:27 INFO BlockManager: BlockManager stopped [2020-04-02T07:36Z] 20/04/02 10:36:27 INFO BlockManagerMaster: BlockManagerMaster stopped [2020-04-02T07:36Z] 20/04/02 10:36:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! [2020-04-02T07:36Z] 20/04/02 10:36:27 INFO SparkContext: Successfully stopped SparkContext [2020-04-02T07:36Z] 10:36:27.122 INFO HaplotypeCallerSpark - Shutting down engine [2020-04-02T07:36Z] [April 2, 2020 10:36:27 AM AST] org.broadinstitute.hellbender.tools.HaplotypeCallerSpark done. Elapsed time: 0.16 minutes. [2020-04-02T07:36Z] Runtime.totalMemory()=1168113664 [2020-04-02T07:36Z] Exception in thread "main" java.lang.StackOverflowError [2020-04-02T07:36Z] at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:225) [2020-04-02T07:36Z] at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:161) [2020-04-02T07:36Z] at com.esotericsoftware.kryo.util.MapReferenceResolver.addWrittenObject(MapReferenceResolver.java:41) [2020-04-02T07:36Z] at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:681) [2020-04-02T07:36Z] at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:570) [2020-04-02T07:36Z] at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) [2020-04-02T07:36Z] at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:505) [2020-04-02T07:36Z] at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575) [2020-04-02T07:36Z] at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) [2020-04-02T07:36Z] at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:505)

and so on. Kryo.java:575, ObjectField.java:80, and FieldSerializer.java:505 error repeated.

Expected behavior I tried to upgrade bcbio_nextgen (and also GATK4), but it returns the same error. When I changed the command to use a single core, it works. bcbio_nextgen.py ../config/rnaseq_var_test.yml -n 1

Is this related to GATK spark issue?

naumenko-sa commented 4 years ago

Thanks for reporting, @yfukasawa!

I'm facing a similar issue in one of the RNA-seq variant calling projects. Running with one thread allowed me to see the underlying issue.

So I think new 1.2.2 bcbio release, which is coming soon + gatk 4.1.6.0 will solve this.

Sergey

roryk commented 4 years ago

@yfukasawa did this still fail with GATK 4.1.6.0? I saw you installed it-- does it also run into this error? Or did you have a different problem with that.

yfukasawa commented 4 years ago

Thank you for swift replies, @naumenko-sa and @roryk. Unfortunately, yes, bcbio 1.2.0 seems to use GATK 4.1.6.0, and I faced the above error with GATK 4.1.6.0. It looks the error still exists in GATK4 4.1.6.0.

If I use a single core, haplotypecaller part was finished. BTW, I found another issue between bcbio 1.2.0 and GATK 4.1.6.0.

Edit: sorry, I noticed the below part is already taken care by 1.2.2 after I posted. Please ignore.

From 4.1.5.0, GenotypeGVCF seems to have changed the option a bit, at least "use-new-qual-calculator" does not exist anymore (https://gatk.broadinstitute.org/hc/en-us/articles/360040509751-GenotypeGVCFs). I found that option in GATK 4.1.4.0, but not in 4.1.5.0 and 4.1.6.0.

The error message I saw is below: [2020-04-02T15:57Z] [2020-04-02T15:57Z] A USER ERROR has occurred: use-new-qual-calculator is not a recognized option [2020-04-02T15:57Z]

Thanks!

naumenko-sa commented 4 years ago

Hi @yfukasawa !

bcbio wrapper of GATK has been recently fixed to meet the deprecation of use-new-qual-calculator https://github.com/bcbio/bcbio-nextgen/commit/188ace3ae6f1cd781464519af8db2eea735a57fb

So it is totally logical that it fails with bcbio 1.2.0 and gatk 4.1.6.0

could you please update bcbio to 1.2.3: bcbio_nextgen.py upgrade -u stable and try again? S

naumenko-sa commented 4 years ago

I'm facing the similar issue when validating RNA-seq variant calling with NA12878 and gatk 4.1.6.0 The current workaround is to use gatk3.8:

tools_off:
- gatk4

yfukasawa commented 4 years ago

Hi @naumenko-sa !

Thank you for following up this. Noted the workaround for gatk 4.1.6.0. I noticed if we specify the number of process = 1, not haplotypecallerspark but haplotypecaller is called. maybe this is why RNA-seq analysis works with n=1.

I have talked about this with my colleague, and he mentioned spark may require HDFS. our cluster doesn't use HDFS, so if this is a requirement, spark version might suffer from file system issue.

thanks!

naumenko-sa commented 4 years ago

Thanks @yfukasawa!

It is indeed a spark issue. The command which is failing is:

#!/bin/bash

export SPARK_USER=sn240 && unset JAVA_HOME && export PATH=/n/app/bcbio/dev/anaconda/bin:"$PATH" && \
gatk --java-options '-Xms750m -Xmx24571m -Djava.io.tmpdir=.' **HaplotypeCallerSpark** \
-R /n/app/bcbio/dev/genomes/Hsapiens/hg38/seq/hg38.fa \
--annotation MappingQualityRankSumTest \
--annotation MappingQualityZero \
--annotation QualByDepth \
--annotation ReadPosRankSumTest \
--annotation RMSMappingQuality \
--annotation BaseQualityRankSumTest \
--annotation FisherStrand \
--annotation MappingQuality \
--annotation DepthPerAlleleBySample \
--annotation Coverage \
-I /path/gatk4_hg38/work/align/NA12878/NA12878-dedup.splitN.bam \
-L /path/gatk4_hg38/work/variation/rnaseq/gatk-haplotype/regions/NA12878-chr1_0_64384459-gatk-haplotype-regions.bed \
--interval-set-rule INTERSECTION \
--spark-master local[2] \
--conf spark.local.dir=. \
--conf spark.driver.host=localhost \
--conf spark.network.timeout=800 \
--conf spark.executor.heartbeatInterval=100 \
--annotation ClippingRankSumTest \
--annotation DepthPerSampleHC \
--native-pair-hmm-threads 1 \
--emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 -ploidy 2 \
--output NA12878-chr1_0_64384459-gatk-haplotype.vcf

and among the log messages there are several warning:

23:38:27.925 INFO  HaplotypeCallerSpark - Use the non-spark HaplotypeCaller if you care about the results.

But you don't need a SPARK cluster to run HaplotypeCallerSpark: https://gatk.broadinstitute.org/hc/en-us/articles/360035890591-Spark

It worked for us before.

Maybe switching to SPARK version decreased our precision in RNA-seq variant calling? Testing some more...

naumenko-sa commented 4 years ago

Currently RNA-seq variant calling is not working with gatk4 and gatk3. It works with vardict.

details:
- algorithm:
    aligner: star
    strandedness: unstranded
    variantcaller: vardict
  analysis: RNA-seq
  description: NA12878
  files:
  - /path/gatk4_hg38/input/SRR307898_1.fq.gz
  - /path/gatk4_hg38/input/SRR307898_2.fq.gz
  genome_build: hg38
  metadata:
    batch: bNA12878
resources:
  default:
    cores: 2
    jvm_opts:
    - -Xms750m
    - -Xmx15000m
    memory: 15G
upload:
  dir: ../final

The issue tracked down to pybedtools and fixed there. Waiting for a release of pybedtools to push gatk variant calling for RNA-seq. https://github.com/bcbio/bcbio-nextgen/issues/3078

naumenko-sa commented 4 years ago

RNA-seq variant calling uses gVCF and with threads > 1 HaplotypeCallerSpark switched on an crashed: turned it off in https://github.com/bcbio/bcbio-nextgen/pull/3214

bcbio / bcbio-nextgen

Variant calling on RNA-seq data using GATK4 #3161