Closed xexpanderx closed 6 years ago
Thanks for the detailed report and sorry about the issue. Are you in a position to update to the latest release (1.0.8)? This uses GATK4 by default and the BQSR step, which is bottlnecking your run here, is parallelized over multiple cores using the new GATK4 Spark implementations. If you have a recent 1.0.7a version you could also try adding tools_on: [gatk4]
to your current configuration and it should make use of GATK4 over the current GATK 3 you have which is not parallelized. Hope this fixes things for you and please let us know if you run into other issues.
Hi, do you mean like this (see the end):
---
# Configuration file specifying system details for running an analysis pipeline
# These pipeline apply generally across multiple projects. Adjust them in sample
# specific configuration files when needed.
# -- Base setup
# Define resources to be used for individual programs on multicore machines.
# These can be defined specifically for memory and processor availability.
# - memory: Specify usage for memory intensive programs. The indicated value specifies the wanted *per core* usage.
# - cores: Define cores that can be used for multicore programs.
# - jvm_opts: specify details
# - cmd: Command to launch the program, if not located on PATH.
# - dir: Directory containing program associated data. Especially useful for
# java jars
resources:
# default options, used if other items below are not present
# avoids needing to configure/adjust for every program
default:
memory: 7G
cores: 64
jvm_opts: ["-Xmx7000m"]
log:
dir: /sw/pipelines/bcbio-nextgen/1.0.5/share/java/log
tmp:
dir: null
ucsc_bigwig:
memory: 7g
bwa:
cmd: bwa
memory: 7g
cores: 64
samtools:
memory: 7G
cores: 64
star:
memory: 7g
cores: 64
snap:
memory: 7G
cores: 64
kraken:
memory: 7G
cores: 64
qsignature:
memory: 7G
cores: 64
qualimap:
memory: 7g
cores: 64
qsnp:
jvm_opts: ["-Xmx7000m"]
gatk:
jvm_opts: ["-Xmx7000m"]
gatk-haplotype:
jvm_opts: ["-Xmx7000m"]
gatk-vqsr:
jvm_opts: ["-Xmx7000m"]
picard:
jvm_opts: ["-Xmx7000m"]
snpeff:
jvm_opts: ["-Xmx7000m"]
bcbio_variation:
jvm_opts: ["-Xmx7000m"]
# rnaseqc:
# dir: /sw/pipelines/bcbio/0.8.9/share/java/RNA-SeQC
# jvm_opts: ["-Xmx7000m"]
# mutect:
# jvm_opts: ["-Xmx7000m"]
# dir: /sw/pipelines/bcbio/0.8.9/share/java/mutect
varscan:
jvm_opts: ["-Xmx7000m"]
vardict:
jvm_opts: ["-Xmx7000m"]
oncofuse:
jvm_opts: ["-Xmx7000m"]
express:
memory: 7g
dexseq:
memory: 7g
macs2:
memory: 7g
seqcluster:
memory: 7g
# Location of galaxy configuration file, which has pointers to reference data
# https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#reference-genome-files
galaxy_config: universe_wsgi.ini
# -- Additional options for specific integration, not required for standalone usage.
# Galaxy integration. Required for retrieving information from Galaxy LIMS.
#galaxy_url: http://your/galaxy/url
#galaxy_api_key: your_galaxy_api_key
# Details for hooking automated processing to a sequencer machine.
# Not required if running standalone pipelines.
# analysis:
# # Can specify a different remote host to initiate
# # the copy from. This is useful for NFS shared filesystems
# # where you want to manage the copy from the base machine.
# copy_user:
# copy_host:
# store_dir: /store4/solexadata
# base_dir: /array0/projects/Sequencing
# worker_program: nextgen_analysis_server.py
tools_on: [gatk4]
Sorry for the confusion. This needs to go in your sample YAML file in the algorithm
section:
algorithm:
tools_on: [gatk4]
More details are in the documentation here:
https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#changing-bcbio-defaults
Hope this helps.
Thank you! I will try that!
Running with GATK4 I encountered this problem: https://github.com/chapmanb/bcbio-nextgen/issues/2037
So, basically, I have to use faToTwoBit to convert our reference file to 2bit indexed genome. My question is, when I have this file, where to I put it so that GATK4 will find it (I don't want to upgrade bcbio because this is a "production" system, upgrading requires a whole lot of validations)?
I guess, I add it in some of those files: alignseq.loc bowtie2_indices.loc bwa_index.loc gatk_sorted_picard_index.loc picard_index.loc sam_fa_indices.loc
gatk_sorted_picard_index.loc looks like this:
hg19_consensus hg19_consensus Human (hg19_consensus) /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta
Could I change it to this (with new 2bit file hg19.with.mt.fasta_twobit):
hg19_consensus hg19_consensus Human (hg19_consensus) /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta
hg19_consensus_twobit hg19_consensus_twobit Human (hg19_consensus_twobit) /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta_twobit
Then I change "genome_build" in the sample config YAML file to: genome_build: hg19_consensus_twobit
Is this correct?
Digging further and looking into tool_data_table_conf.xml, I find this section:
<!-- UCSC twoBit indexed files -->
<table name="twobit">
<columns>dbkey, value</columns>
<file path="tool-data/twobit.loc" />
</table>
Does this mean I should create a file "twobit.loc" defining our custom reference 2bit indexed genome there?
Trying to use "twobit.loc" like this:
hg19_consensus /data/ref_genomes/hg19/seq/hg19.with.mt.2bit
GATK4 still fails with Reference=None. Obviously, it looks for another loc-file? I'm stucked, waiting for answers.
Thank you.
Sorry about the confusion. bcbio looks for the 2bit version in a ucsc
directory relative to the fasta file so you'll have:
hg19/seq/hg19.with.mt.fa
hg19/ucsc/hg19.with.mt.2bit
Hope that gets it working for you.
Thank you, and no, is not working.
This is what I have:
alignseq.loc:
hg19 hg19 hg19 /sw/pipelines/bcbio/0.9.9/genomes/human/hg19/seq/hg19.fa
bowtie2_indices.loc:
hg19 hg19 Human (hg19) /sw/pipelines/bcbio-nextgen/1.0.5/genomes/Hsapiens/hg19/bowtie2/hg19
bwa_index.loc:
hg19_consensus hg19_consensus hg19_consensus /data
/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta
gatk_sorted_picard_index.loc:
hg19_consensus hg19_consensus Human (hg19_consensus) /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta
picard_index.loc:
hg19_consensus hg19_consensus Human (hg19_consensus) /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta
sam_fa_indices.loc:
index hg19_consensus /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta
index hg19 /sw/pipelines/bcbio/0.9.9/genomes/human/hg19/seq/hg19.fa
twobit.loc:
hg19_consensus /data/ref_genomes/hg19/ucsc/hg19.with.mt.2bit
ls -al /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta:
/data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta => /data/ref_genomes/hg19/seq/hg19.with.mt.fasta
And I also have: ls /data/ref_genomes/hg19/ucsc/:
hg19.with.mt.2bit
hg19.with.mt.2bit.chrom.sizes
Still getting --reference None.
I found the error, it is because /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta is a symlink to /data/ref_genomes/hg19/seq/hg19.with.mt.fasta. But bcbio does not understand that the relative path to the 2bit indexed genome should then be: /data/ref_genomes/hg19/ucsc/hg19.with.mt.2bit
So, I had to symlink /data/ref_genomes/hg19/ucsc/ to /data/ref_genomes/hg19/bwa/ for it to work.
It should be easier to define the full path in a loc file instead for GATK4?
Sorry about the continued issues and glad that you got it figured out. We're slowly moving away from *.loc files into using a consistent directory structure for new projects, so symlinking or adjusting to the standardized structure is exactly the right thing to do here. The consistent directory structure is easier to support and integrates better with in-progress common workflow language (CWL) integration. Glad you got it figured out and hope things work cleanly going forward.
I have run bcbio on whole-genome and realized that the GATK PrintReads step is a major bottleneck. Right now I run only whole-genome on one individual. I can see that "nct" option is not defined looking at the log:
[2018-02-12T23:45Z] compute02: export JAVA_HOME=/sw/compilers/oracle-jdk-1.8/1.8.0_152 && export PATH=/sw/compilers/oracle-jdk-1.8/1.8.0_152/bin:$PATH && /sw/pipelines/bcbio-nextgen/1.0.5/anaconda/bin/gatk -Xmx91728m -Djava.io.tmpdir=/gluster-storage-volume/projects/wp3/WGS/QE-1452/bcbio/bcbiotx/tmpmdp8Lc -T PrintReads -R /data/ref_genomes/hg19/bwa/BWA_0.7.10_refseq/hg19.with.mt.fasta -I /gluster-storage-volume/projects/wp3/WGS/QE-1452/bcbio/align/55683/55683-sort.bam -BQSR /gluster-storage-volume/projects/wp3/WGS/QE-1452/bcbio/align/55683/55683-sort-recal.grp -o /gluster-storage-volume/projects/wp3/WGS/QE-1452/bcbio/bcbiotx/tmpmdp8Lc/55683-sort-recal.bam -jdk_deflater -jdk_inflater -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment
My bcbio-config looks like this:
Any way to parallelize PrintReads?
bcbio version used: 1.0.7a