PrintReads - No parallelization

xexpanderx commented 6 years ago

I have run bcbio on whole-genome and realized that the GATK PrintReads step is a major bottleneck. Right now I run only whole-genome on one individual. I can see that "nct" option is not defined looking at the log:

[2018-02-12T23:45Z] compute02: export JAVA_HOME=/sw/compilers/oracle-jdk-1.8/1.8.0_152 && export PATH=/sw/compilers/oracle-jdk-1.8/1.8.0_152/bin:$PATH && /sw/pipelines/bcbio-nextgen/1.0.5/anaconda/bin/gatk -Xmx91728m -Djava.io.tmpdir=/gluster-storage-volume/projects/wp3/WGS/QE-1452/bcbio/bcbiotx/tmpmdp8Lc -T PrintReads -R /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta -I /gluster-storage-volume/projects/wp3/WGS/QE-1452/bcbio/align/55683/55683-sort.bam -BQSR /gluster-storage-volume/projects/wp3/WGS/QE-1452/bcbio/align/55683/55683-sort-recal.grp -o /gluster-storage-volume/projects/wp3/WGS/QE-1452/bcbio/bcbiotx/tmpmdp8Lc/55683-sort-recal.bam -jdk_deflater -jdk_inflater -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment

My bcbio-config looks like this:

---
# Configuration file specifying system details for running an analysis pipeline
# These pipeline apply generally across multiple projects. Adjust them in sample
# specific configuration files when needed.

# -- Base setup

# Define resources to be used for individual programs on multicore machines.
# These can be defined specifically for memory and processor availability.
# - memory: Specify usage for memory intensive programs. The indicated value specifies the wanted *per core* usage.
# - cores: Define cores that can be used for multicore programs.
# - jvm_opts: specify details
# - cmd: Command to launch the program, if not located on PATH.
# - dir: Directory containing program associated data. Especially useful for
#        java jars
resources:
  # default options, used if other items below are not present
  # avoids needing to configure/adjust for every program
  default:
    memory: 7G
    cores: 64
    jvm_opts: ["-Xmx7000m"]
  log:
    dir: /sw/pipelines/bcbio-nextgen/1.0.5/share/java/log
  tmp:
    dir: null
  ucsc_bigwig:
    memory: 7g
  bwa:
    cmd: bwa
    memory: 7g
    cores: 64
  samtools:
    memory: 7G
    cores: 64
  star:
    memory: 7g
    cores: 64
  snap:
    memory: 7G
    cores: 64
  kraken:
    memory: 7G
    cores: 64
  qsignature:
    memory: 7G
    cores: 64
  qualimap:
    memory: 7g
    cores: 64
  qsnp:
    jvm_opts: ["-Xmx7000m"]
  gatk:
    jvm_opts: ["-Xmx7000m"]
  gatk-haplotype:
    jvm_opts: ["-Xmx7000m"]
  gatk-vqsr:
    jvm_opts: ["-Xmx7000m"]
  picard:
    jvm_opts: ["-Xmx7000m"]
  snpeff:
    jvm_opts: ["-Xmx7000m"]
  bcbio_variation:
    jvm_opts: ["-Xmx7000m"]
#  rnaseqc:
#    dir: /sw/pipelines/bcbio/0.8.9/share/java/RNA-SeQC
#    jvm_opts: ["-Xmx7000m"]
#  mutect:
#    jvm_opts: ["-Xmx7000m"]
#    dir: /sw/pipelines/bcbio/0.8.9/share/java/mutect
  varscan:
    jvm_opts: ["-Xmx7000m"]
  vardict:
    jvm_opts: ["-Xmx7000m"]
  oncofuse:
    jvm_opts: ["-Xmx7000m"]
  express:
    memory: 7g
  dexseq:
    memory: 7g
  macs2:
    memory: 7g
  seqcluster:
    memory: 7g

# Location of galaxy configuration file, which has pointers to reference data
# https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#reference-genome-files
galaxy_config: universe_wsgi.ini

# -- Additional options for specific integration, not required for standalone usage.

# Galaxy integration. Required for retrieving information from Galaxy LIMS.
#galaxy_url: http://your/galaxy/url
#galaxy_api_key: your_galaxy_api_key

# Details for hooking automated processing to a sequencer machine.
# Not required if running standalone pipelines.
# analysis:
#   # Can specify a different remote host to initiate
#   # the copy from. This is useful for NFS shared filesystems
#   # where you want to manage the copy from the base machine.
#   copy_user:
#   copy_host:
#   store_dir: /store4/solexadata
#   base_dir: /array0/projects/Sequencing
#   worker_program: nextgen_analysis_server.py

Any way to parallelize PrintReads?

bcbio version used: 1.0.7a

chapmanb commented 6 years ago

Thanks for the detailed report and sorry about the issue. Are you in a position to update to the latest release (1.0.8)? This uses GATK4 by default and the BQSR step, which is bottlnecking your run here, is parallelized over multiple cores using the new GATK4 Spark implementations. If you have a recent 1.0.7a version you could also try adding tools_on: [gatk4] to your current configuration and it should make use of GATK4 over the current GATK 3 you have which is not parallelized. Hope this fixes things for you and please let us know if you run into other issues.

xexpanderx commented 6 years ago

Hi, do you mean like this (see the end):

---
# Configuration file specifying system details for running an analysis pipeline
# These pipeline apply generally across multiple projects. Adjust them in sample
# specific configuration files when needed.

# -- Base setup

# Define resources to be used for individual programs on multicore machines.
# These can be defined specifically for memory and processor availability.
# - memory: Specify usage for memory intensive programs. The indicated value specifies the wanted *per core* usage.
# - cores: Define cores that can be used for multicore programs.
# - jvm_opts: specify details
# - cmd: Command to launch the program, if not located on PATH.
# - dir: Directory containing program associated data. Especially useful for
#        java jars
resources:
  # default options, used if other items below are not present
  # avoids needing to configure/adjust for every program
  default:
    memory: 7G
    cores: 64
    jvm_opts: ["-Xmx7000m"]
  log:
    dir: /sw/pipelines/bcbio-nextgen/1.0.5/share/java/log
  tmp:
    dir: null
  ucsc_bigwig:
    memory: 7g
  bwa:
    cmd: bwa
    memory: 7g
    cores: 64
  samtools:
    memory: 7G
    cores: 64
  star:
    memory: 7g
    cores: 64
  snap:
    memory: 7G
    cores: 64
  kraken:
    memory: 7G
    cores: 64
  qsignature:
    memory: 7G
    cores: 64
  qualimap:
    memory: 7g
    cores: 64
  qsnp:
    jvm_opts: ["-Xmx7000m"]
  gatk:
    jvm_opts: ["-Xmx7000m"]
  gatk-haplotype:
    jvm_opts: ["-Xmx7000m"]
  gatk-vqsr:
    jvm_opts: ["-Xmx7000m"]
  picard:
    jvm_opts: ["-Xmx7000m"]
  snpeff:
    jvm_opts: ["-Xmx7000m"]
  bcbio_variation:
    jvm_opts: ["-Xmx7000m"]
#  rnaseqc:
#    dir: /sw/pipelines/bcbio/0.8.9/share/java/RNA-SeQC
#    jvm_opts: ["-Xmx7000m"]
#  mutect:
#    jvm_opts: ["-Xmx7000m"]
#    dir: /sw/pipelines/bcbio/0.8.9/share/java/mutect
  varscan:
    jvm_opts: ["-Xmx7000m"]
  vardict:
    jvm_opts: ["-Xmx7000m"]
  oncofuse:
    jvm_opts: ["-Xmx7000m"]
  express:
    memory: 7g
  dexseq:
    memory: 7g
  macs2:
    memory: 7g
  seqcluster:
    memory: 7g

# Location of galaxy configuration file, which has pointers to reference data
# https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#reference-genome-files
galaxy_config: universe_wsgi.ini

# -- Additional options for specific integration, not required for standalone usage.

# Galaxy integration. Required for retrieving information from Galaxy LIMS.
#galaxy_url: http://your/galaxy/url
#galaxy_api_key: your_galaxy_api_key

# Details for hooking automated processing to a sequencer machine.
# Not required if running standalone pipelines.
# analysis:
#   # Can specify a different remote host to initiate
#   # the copy from. This is useful for NFS shared filesystems
#   # where you want to manage the copy from the base machine.
#   copy_user:
#   copy_host:
#   store_dir: /store4/solexadata
#   base_dir: /array0/projects/Sequencing
#   worker_program: nextgen_analysis_server.py
tools_on: [gatk4]

chapmanb commented 6 years ago

Sorry for the confusion. This needs to go in your sample YAML file in the algorithm section:

algorithm:
   tools_on: [gatk4]

More details are in the documentation here:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#changing-bcbio-defaults

Hope this helps.

xexpanderx commented 6 years ago

Thank you! I will try that!

xexpanderx commented 6 years ago

Running with GATK4 I encountered this problem: https://github.com/chapmanb/bcbio-nextgen/issues/2037

So, basically, I have to use faToTwoBit to convert our reference file to 2bit indexed genome. My question is, when I have this file, where to I put it so that GATK4 will find it (I don't want to upgrade bcbio because this is a "production" system, upgrading requires a whole lot of validations)?

I guess, I add it in some of those files: alignseq.loc bowtie2_indices.loc bwa_index.loc gatk_sorted_picard_index.loc picard_index.loc sam_fa_indices.loc

gatk_sorted_picard_index.loc looks like this: hg19_consensus hg19_consensus Human (hg19_consensus) /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta

Could I change it to this (with new 2bit file hg19.with.mt.fasta_twobit):

hg19_consensus  hg19_consensus  Human (hg19_consensus)  /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta
hg19_consensus_twobit hg19_consensus_twobit Human (hg19_consensus_twobit)  /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta_twobit

Then I change "genome_build" in the sample config YAML file to: genome_build: hg19_consensus_twobit

Is this correct?

xexpanderx commented 6 years ago

Digging further and looking into tool_data_table_conf.xml, I find this section:

<!-- UCSC twoBit indexed files -->
<table name="twobit">
<columns>dbkey, value</columns>
<file path="tool-data/twobit.loc" />
</table>

Does this mean I should create a file "twobit.loc" defining our custom reference 2bit indexed genome there?

xexpanderx commented 6 years ago

Trying to use "twobit.loc" like this:

hg19_consensus /data/ref_genomes/hg19/seq/hg19.with.mt.2bit

GATK4 still fails with Reference=None. Obviously, it looks for another loc-file? I'm stucked, waiting for answers.

Thank you.

chapmanb commented 6 years ago

Sorry about the confusion. bcbio looks for the 2bit version in a ucsc directory relative to the fasta file so you'll have:

hg19/seq/hg19.with.mt.fa
hg19/ucsc/hg19.with.mt.2bit

Hope that gets it working for you.

xexpanderx commented 6 years ago

Thank you, and no, is not working.

This is what I have:

alignseq.loc: hg19 hg19 hg19 /sw/pipelines/bcbio/0.9.9/genomes/human/hg19/seq/hg19.fa

bowtie2_indices.loc: hg19 hg19 Human (hg19) /sw/pipelines/bcbio-nextgen/1.0.5/genomes/Hsapiens/hg19/bowtie2/hg19

bwa_index.loc:

hg19_consensus  hg19_consensus  hg19_consensus  /data
/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta

gatk_sorted_picard_index.loc: hg19_consensus hg19_consensus Human (hg19_consensus) /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta

picard_index.loc: hg19_consensus hg19_consensus Human (hg19_consensus) /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta

sam_fa_indices.loc:

index   hg19_consensus  /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta
index   hg19    /sw/pipelines/bcbio/0.9.9/genomes/human/hg19/seq/hg19.fa

twobit.loc: hg19_consensus /data/ref_genomes/hg19/ucsc/hg19.with.mt.2bit

ls -al /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta: /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta => /data/ref_genomes/hg19/seq/hg19.with.mt.fasta

And I also have: ls /data/ref_genomes/hg19/ucsc/:

hg19.with.mt.2bit
hg19.with.mt.2bit.chrom.sizes

Still getting --reference None.

xexpanderx commented 6 years ago

I found the error, it is because /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta is a symlink to /data/ref_genomes/hg19/seq/hg19.with.mt.fasta. But bcbio does not understand that the relative path to the 2bit indexed genome should then be: /data/ref_genomes/hg19/ucsc/hg19.with.mt.2bit

So, I had to symlink /data/ref_genomes/hg19/ucsc/ to /data/ref_genomes/hg19/bwa/ for it to work.

It should be easier to define the full path in a loc file instead for GATK4?

chapmanb commented 6 years ago

Sorry about the continued issues and glad that you got it figured out. We're slowly moving away from *.loc files into using a consistent directory structure for new projects, so symlinking or adjusting to the standardized structure is exactly the right thing to do here. The consistent directory structure is easier to support and integrates better with in-progress common workflow language (CWL) integration. Glad you got it figured out and hope things work cleanly going forward.

bcbio / bcbio-nextgen

PrintReads - No parallelization #2265