Open gene-drive opened 2 years ago
The pipeline failed at several macs
tasks. Looks like out-of-memory or out-of-disk-space problem?
BTW how did you build your custom genome TSV file?
I notice that the original IDR script I submit as a job doesn't use much memory at all, but it does spawn several smaller jobs with the name cromwell_
. Are these the ones that would need more memory? How would I set the memory requirement, and do you have a recommendation on how much memory to request?
To build the genome, I used build_genome_data.sh and made the following modifications:
elif [[ "${GENOME}" == "chrPic_GB_10k" ]]; then
# Perl style regular expression to keep regular chromosomes only.
# this reg-ex will be applied to peaks after blacklist filtering (b-filt) with "grep -P".
# so that b-filt peak file (.bfilt.*Peak.gz) will only have chromosomes matching with this pattern
# this reg-ex will work even without a blacklist.
# you will still be able to find a .bfilt. peak file
# use ".*", which means ALL CHARACTERS, if you want to keep all chromosomes
# use "chr[\dXY]+" to allow chr[NUMBERS], chrX and chrY only
# this is important to make your final output peak file (bigBed) work with genome browsers
REGEX_BFILT_PEAK_CHR_NAME=".*"
# REGEX_BFILT_PEAK_CHR_NAME="chr[\dXY]+"
# mitochondrial chromosome name (e.g. chrM, MT)
MITO_CHR_NAME="chrM"
# URL for your reference FASTA (fasta, fasta.gz, fa, fa.gz, 2bit)
REF_FA="https://www.dropbox.com/s/xxxxxxx/chrPic_GenBank.fa.gz"
# 3-col blacklist BED file to filter out overlapping peaks from b-filt peak file (.bfilt.*Peak.gz file).
# leave it empty if you don't have one
BLACKLIST=
fi
Because the genome I'm using has thousands of scaffolds and the header names are long, I decided to use ".*"
to allow all characters and keep all the chromosomes.
For reference, these are what the headers of the fasta look like:
>JAAOEE010000028.1 Chrysemys picta bellii isolate R12L10 Contig25.1, whole genome shotgun sequence
>MU021221.1 Chrysemys picta bellii isolate R12L10 unplaced genomic scaffold Scaffold39, whole genome shotgun sequence
Another question regarding the line recommending using "chr[\dXY]+"
, what does it mean that # this is important to make your final output peak file (bigBed) work with genome browsers
?
For heavy tasks (bowtie2, spp, macs2, ...), pipeline allocates memory according to the size of inputs but for IDR it allocates a fixed amount of memory 4GB.
You need to edit this line if IDR fails due to OOM: https://github.com/ENCODE-DCC/chip-seq-pipeline2/blob/master/chip.wdl#L2883
cromwell_
is a prefix for all jobs. Caper+chip-seq-pipeline will automatically allocate a good amount of memory for each job. In most cases you don't need to modify memory settings (chip.mem_factor_*
parameters).
# this is important to make your final output peak file (bigBed) work with genome browsers
means that chromosome names left after filtering with your REGEX should match with genome browser's assembly. For example, chr1
of UCSC browser's assembly hg19
.
It looks like macs2 failed due to OOM?
I found this in the error log. So pipeline tried to allocate 2048M for MACS2, which is too small for most cases. I think BAM file is too small?. Please find alignment QC log files *.qc
and check number of reads.
for ITER in 1 2 3
do
sbatch --export=ALL -J cromwell_8d6144a6_read_genome_tsv -D /lustre2/scratch/usr1/chrPicGB__IDR_v2/10k_macs2/chip/8d6144a6-70f6-4177-8499-98c083a9e0ad/call-read_genome_tsv -o /lustre2/scratch/usr1/chrPicGB__IDR_v2/10k_macs2/chip/8d6144a6-70f6-4177-8499-98c083a9e0ad/call-read_genome_tsv/execution/stdout -e /lustre2/scratch/usr1/chrPicGB__IDR_v2/10k_macs2/chip/8d6144a6-70f6-4177-8499-98c083a9e0ad/call-read_genome_tsv/execution/stderr \
-p batch --account dmlab \
-n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=2048M --time=240 \
\
/lustre2/scratch/usr1/chrPicGB__IDR_v2/10k_macs2/chip/8d6144a6-70f6-4177-8499-98c083a9e0ad/call-read_genome_tsv/execution/script.caper && exit 0
sleep 30
done
Describe the bug
The pipeline stalls or ends prematurely before the IDR steps. I have successfully run the pipeline on datasets using the mm10 genome, but it stalls when I use custom genomes.
OS/Platform
$ conda --version
).Caper configuration file
Paste contents of
~/.caper/default.conf
.Input JSON file
Paste contents of your input JSON file.
Troubleshooting result
Below are some of the output logs from a run that ended prematurely.
cromwell.out.log output.err.log output.out.log