ENCODE-DCC / chip-seq-pipeline2

ENCODE ChIP-seq pipeline
MIT License
247 stars 123 forks source link

Pipeline stalls before IDR steps #276

Open gene-drive opened 2 years ago

gene-drive commented 2 years ago

Describe the bug

The pipeline stalls or ends prematurely before the IDR steps. I have successfully run the pipeline on datasets using the mm10 genome, but it stalls when I use custom genomes.

OS/Platform

Caper configuration file

Paste contents of ~/.caper/default.conf.

backend=slurm

# define one of the followings (or both) according to your
# cluster's SLURM configuration.

# SLURM partition. Define only if required by a cluster. You must define it for Stanford Sherlock.
slurm-partition=batch
# SLURM account. Define only if required by a cluster. You must define it for Stanford SCG.
slurm-account=usr1lab

# This parameter is NOT for 'caper submit' BUT for 'caper run' and 'caper server' only.
# This resource parameter string will be passed to sbatch, qsub, bsub, ...
# You can customize it according to your cluster's configuration.

# Note that Cromwell's implicit type conversion (String to Integer)
# seems to be buggy for WomLong type memory variables (memory_mb and memory_gb).
# So be careful about using the + operator between WomLong and other types (String, even Int).
# For example, ${"--mem=" + memory_mb} will not work since memory_mb is WomLong.
# Use ${"if defined(memory_mb) then "--mem=" else ""}{memory_mb}${"if defined(memory_mb) then "mb " else " "}
# See https://github.com/broadinstitute/cromwell/issues/4659 for details

# Cromwell's built-in variables (attributes defined in WDL task's runtime)
# Use them within ${} notation.
# - cpu: number of cores for a job (default = 1)
# - memory_mb, memory_gb: total memory for a job in MB, GB
#   * these are converted from 'memory' string attribute (including size unit)
#     defined in WDL task's runtime
# - time: time limit for a job in hour
# - gpu: specified gpu name or number of gpus (it's declared as String)

slurm-resource-param=-n 1 --ntasks-per-node=1 --cpus-per-task=${cpu} ${if defined(memory_mb) then "--mem=" else ""}${memory_mb}${if defined(memory_mb) then "M" else ""} ${if defined(time) then "--time=" else ""}${time*60} ${if defined(gpu) then "--gres=gpu:" else ""}${gpu} 

# If needed uncomment and define any extra SLURM sbatch parameters here
# YOU CANNOT USE WDL SYNTAX AND CROMWELL BUILT-IN VARIABLES HERE
#slurm-extra-param=

# Hashing strategy for call-caching (3 choices)
# This parameter is for local (local/slurm/sge/pbs/lsf) backend only.
# This is important for call-caching,
# which means re-using outputs from previous/failed workflows.
# Cache will miss if different strategy is used.
# "file" method has been default for all old versions of Caper<1.0.
# "path+modtime" is a new default for Caper>=1.0,
#   file: use md5sum hash (slow).
#   path: use path.
#   path+modtime: use path and modification time.
local-hash-strat=path+modtime

# Metadata DB for call-caching (reusing previous outputs):
# Cromwell supports restarting workflows based on a metadata DB
# DB is in-memory by default
#db=in-memory

# If you use 'caper server' then you can use one unified '--file-db'
# for all submitted workflows. In such case, uncomment the following two lines
# and defined file-db as an absolute path to store metadata of all workflows
#db=file
#file-db=

# If you use 'caper run' and want to use call-caching:
# Make sure to define different 'caper run ... --db file --file-db DB_PATH'
# for each pipeline run.
# But if you want to restart then define the same '--db file --file-db DB_PATH'
# then Caper will collect/re-use previous outputs without running the same task again
# Previous outputs will be simply hard/soft-linked.

# Local directory for localized files and Cromwell's intermediate files
# If not defined, Caper will make .caper_tmp/ on local-out-dir or CWD.
# /tmp is not recommended here since Caper store all localized data files
# on this directory (e.g. input FASTQs defined as URLs in input JSON).
local-loc-dir=

cromwell=/home/usr1/.caper/cromwell_jar/cromwell-65.jar
womtool=/home/usr1/.caper/womtool_jar/womtool-65.jar

Input JSON file

Paste contents of your input JSON file.

{
    "chip.pipeline_type" : "tf",
    "chip.genome_tsv" : "/scratch/usr1/chrPicGB_IDR_v2/build_genome_output_10k/chrPic_GB_10k.tsv",
    "chip.fastqs_rep1_R1" : ["/scratch/usr1/chrPic1_IDR/ts_ChIP_R1.fastq.gz" ],
    "chip.fastqs_rep2_R1" : ["/scratch/usr1/chrPic1_IDR/ts_ChIP_R3.fastq.gz" ],
    "chip.ctl_fastqs_rep1_R1" : ["/scratch/usr1/chrPic1_IDR/input_R1.fastq.gz" ],
    "chip.ctl_fastqs_rep2_R1" : ["/scratch/usr1/chrPic1_IDR/nput_R3.fastq.gz" ],
    "chip.paired_end" : false,
    "chip.peak_caller" : "macs2", 
    "chip.title" : "chrPic_GB_10k_macs2",
    "chip.description" : "ChIP-Seq"
}

Troubleshooting result

Below are some of the output logs from a run that ended prematurely.

cromwell.out.log output.err.log output.out.log

leepc12 commented 2 years ago

The pipeline failed at several macs tasks. Looks like out-of-memory or out-of-disk-space problem? BTW how did you build your custom genome TSV file?

gene-drive commented 2 years ago

I notice that the original IDR script I submit as a job doesn't use much memory at all, but it does spawn several smaller jobs with the name cromwell_. Are these the ones that would need more memory? How would I set the memory requirement, and do you have a recommendation on how much memory to request?

To build the genome, I used build_genome_data.sh and made the following modifications:

elif [[ "${GENOME}" == "chrPic_GB_10k" ]]; then
  # Perl style regular expression to keep regular chromosomes only.
  # this reg-ex will be applied to peaks after blacklist filtering (b-filt) with "grep -P".
  # so that b-filt peak file (.bfilt.*Peak.gz) will only have chromosomes matching with this pattern
  # this reg-ex will work even without a blacklist.
  # you will still be able to find a .bfilt. peak file
  # use ".*", which means ALL CHARACTERS, if you want to keep all chromosomes
  # use "chr[\dXY]+" to allow chr[NUMBERS], chrX and chrY only
  # this is important to make your final output peak file (bigBed) work with genome browsers
  REGEX_BFILT_PEAK_CHR_NAME=".*"
  # REGEX_BFILT_PEAK_CHR_NAME="chr[\dXY]+"

  # mitochondrial chromosome name (e.g. chrM, MT)
  MITO_CHR_NAME="chrM"
  # URL for your reference FASTA (fasta, fasta.gz, fa, fa.gz, 2bit)
  REF_FA="https://www.dropbox.com/s/xxxxxxx/chrPic_GenBank.fa.gz"
  # 3-col blacklist BED file to filter out overlapping peaks from b-filt peak file (.bfilt.*Peak.gz file).
  # leave it empty if you don't have one
  BLACKLIST=
fi

Because the genome I'm using has thousands of scaffolds and the header names are long, I decided to use ".*" to allow all characters and keep all the chromosomes.

For reference, these are what the headers of the fasta look like:

>JAAOEE010000028.1 Chrysemys picta bellii isolate R12L10 Contig25.1, whole genome shotgun sequence
>MU021221.1 Chrysemys picta bellii isolate R12L10 unplaced genomic scaffold Scaffold39, whole genome shotgun sequence

Another question regarding the line recommending using "chr[\dXY]+", what does it mean that # this is important to make your final output peak file (bigBed) work with genome browsers?

leepc12 commented 2 years ago

For heavy tasks (bowtie2, spp, macs2, ...), pipeline allocates memory according to the size of inputs but for IDR it allocates a fixed amount of memory 4GB.

You need to edit this line if IDR fails due to OOM: https://github.com/ENCODE-DCC/chip-seq-pipeline2/blob/master/chip.wdl#L2883

cromwell_ is a prefix for all jobs. Caper+chip-seq-pipeline will automatically allocate a good amount of memory for each job. In most cases you don't need to modify memory settings (chip.mem_factor_* parameters).

# this is important to make your final output peak file (bigBed) work with genome browsers means that chromosome names left after filtering with your REGEX should match with genome browser's assembly. For example, chr1 of UCSC browser's assembly hg19.

It looks like macs2 failed due to OOM? I found this in the error log. So pipeline tried to allocate 2048M for MACS2, which is too small for most cases. I think BAM file is too small?. Please find alignment QC log files *.qc and check number of reads.

for ITER in 1 2 3
do
    sbatch --export=ALL -J cromwell_8d6144a6_read_genome_tsv -D /lustre2/scratch/usr1/chrPicGB__IDR_v2/10k_macs2/chip/8d6144a6-70f6-4177-8499-98c083a9e0ad/call-read_genome_tsv -o /lustre2/scratch/usr1/chrPicGB__IDR_v2/10k_macs2/chip/8d6144a6-70f6-4177-8499-98c083a9e0ad/call-read_genome_tsv/execution/stdout -e /lustre2/scratch/usr1/chrPicGB__IDR_v2/10k_macs2/chip/8d6144a6-70f6-4177-8499-98c083a9e0ad/call-read_genome_tsv/execution/stderr \
        -p batch --account dmlab \
        -n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=2048M --time=240  \
         \
        /lustre2/scratch/usr1/chrPicGB__IDR_v2/10k_macs2/chip/8d6144a6-70f6-4177-8499-98c083a9e0ad/call-read_genome_tsv/execution/script.caper && exit 0
    sleep 30
done