Open samreenzafer opened 5 days ago
On a HPC it's normal for Nextflow to submit many smaller jobs when you use the lsf
executor. Pending jobs can sometimes get stuck if Nextflow exits suddenly and doesn't have time to clean up.
When a process exits with code 137, it means a process has been killed because it exceeded requested resources. The EXTRACT_DATABASE
process has been killed by your scheduler, which causes the workflow to exit with an error (exit code 1).
Here's a configuration profile I use for UK Biobank: https://github.com/PGScatalog/pgsc_calc/issues/328#issuecomment-2199838475
It works fine for ~150 scores. This configuration does a few things:
1) It will automatically resubmit jobs up to 3 times if they failed because of resource problems, but request more resources 2) It defines the amount of resources precisely needed for each process
Thank You. I tried using your configuartion profile, but changed a few lines as highlighted in BOLD below:
**executor {
name = 'lsf'
jobName = { "$task.hash" }
}**
process {
errorStrategy = 'retry'
maxRetries = 3
maxErrors = '-1'
**executor = 'lsf'
queue = 'premium'
clusterOptions = ' -P acc_CranioProject '**
withName: 'SAMPLESHEET_JSON' {
and so on...
Then I submitted the job with new config file as follows: (I asked for 4 cores, and 64Gb for each) since the module requring the largest process in your config file had such requirements.
bsub -J CNICS.$pop -P acc_rareADRs -q premium -n 4 -W 4:00 -R rusage[mem=64000] -oo $job.$pop.o -eo $job.$pop.e -L /bin/bash sh $job $pop
I did not get the Error 137, but got a "pgscatalog.core.lib.pgsexceptions.QueryError: Can't query PGS Catalog API"
while DOWNLOAD_SCORE
attempted it three times, with Exit status 11. I then tried to manually see if the DOWNLOAD_SCORE
would work on command line and it in fact did, as show below. So I'm confused if there could be other settings I should consider changing.
[zafers02@li03c02 test_nextflow_CNICSonly]$ singularity shell ../pgsc-calc/work/singularity/ghcr.io-pgscatalog-pygscatalog-pgscatalog-utils-1.1.2-singularity.img ^C
[zafers02@li03c02 test_nextflow_CNICSonly]$ cat work/7f/357285bcc8f87e5542676c30dce421/.command.sh
#!/bin/bash -euo pipefail
pgscatalog-download -i PGS000036 PGS003446 PGS002237 PGS002280 -b GRCh37 -o $PWD -v -c pgsc_calc/2.0.0-beta
cat <<-END_VERSIONS > versions.yml
DOWNLOAD_SCOREFILES:
pgscatalog.core: $(echo $(python -c 'import pgscatalog.core; print(pgscatalog.core.__version__)'))
END_VERSIONS
[zafers02@li03c02 test_nextflow_CNICSonly]$ singularity shell ../pgsc-calc/work/singularity/ghcr.io-pgscatalog-pygscatalog-pgscatalog-utils-1.1.2-singularity.img
Singularity> pgscatalog-download -i PGS000036 PGS003446 PGS002237 PGS002280 -b GRCh37 -o $PWD -v -c pgsc_calc/2.0.0-beta
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:56 DEBUG Verbose logging enabled
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:56 INFO Setting user agent to pgsc_calc/2.0.0-beta
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:56 INFO Downloading scoring files that have been harmonised to build=GenomeBuild.GRCh37
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO Submitting ScoringFile('PGS000036', target_build=GenomeBuild.GRCh37) download
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO Submitting ScoringFile('PGS002237', target_build=GenomeBuild.GRCh37) download
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO Submitting ScoringFile('PGS002280', target_build=GenomeBuild.GRCh37) download
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO Submitting ScoringFile('PGS003446', target_build=GenomeBuild.GRCh37) download
0%| | 0/4 [00:00<?, ?it/s]pgscatalog.core.cli.download_cli: 2024-07-01 12:45:59 INFO Download complete
25%|████████████████████████████████████████ | 1/4 [00:01<00:05, 1.89s/it]pgscatalog.core.cli.download_cli: 2024-07-01 12:46:03 INFO Download complete
50%|████████████████████████████████████████████████████████████████████████████████ | 2/4 [00:05<00:06, 3.19s/it]pgscatalog.core.cli.download_cli: 2024-07-01 12:46:04 INFO Download complete
75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 3/4 [00:07<00:02, 2.28s/it]pgscatalog.core.cli.download_cli: 2024-07-01 12:46:07 INFO Download complete
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:10<00:00, 2.53s/it]
pgscatalog.core.cli.download_cli: 2024-07-01 12:46:07 INFO All downloads finished
Singularity>
Here is my nextflow log file job2.nextflow.log.txt and job_output_log file job2.job.CNICS.lsf.sh.CAU.o.txt
I am going to try deleting the entire work
folder and re-run the job.
Can I download all PGS traits before hand and ask nextflow to use the downloaded files from a direcotry, rather than trying to download files live when the pipeline is running? Something similar to the reference files?
You could use pgscatalog-download
to preload scoring files
The --scorefile
parameter supports multiple local scoring files.
You can install the pgscatalog.core
package with pip or bioconda.
Thanks. I downloaded some PRS score files and then ran a job testing 1 PRS trait ( using --scorefile) and it runs to completion. So I've bypassed the issue of the job not being able to execute the "DOWNLOAD_SCORE" step.
I will now begin benchmarking the resource requirements for multiple traits since we intend to run all ~4800 traits from PGScatalog for both our CAU and AFR subset of target data which each have >4000 samples. I did have to reduce all the memory requirements for subtasks (in the config file) to 8GB for anything more than 8GB, otherwise my jobs were pending on the cluster queues without entering the "run" stage.
If you ever figure out the original "pgscatalog.core.lib.pgsexceptions.QueryError: Can't query PGS Catalog API" error, or if our HPC informaticians help me figure this out, I will update here.
thank you.
Hi. I've been able to run a few traits along with my data on command line (on our department's computing cluster) and I'm now trying to scale it for thousands of PGSids, by using the lsf queue system. I've finally been able to get 1 job running as I'll show below, but it fails every time I submit it at a different point. The log files are quite large, so I'll try to upload them here instead of pasting here.
My main job has exited with error but I still see one of the sub-jobs that the workflow creates and submits to the cluster, being in PENDING state on our cluster, which is strange.
I submitted the main job as below.
The shell script looks like this:
And this is what the
nextflow.lsf.config
file looks like.I still see the following sub-job pending execution on the cluster queue, even though the main job "CNICS.CAU" had exited with error. [zafers02@li03c02 test_nextflow_CNICSonly]$ bjobs JOBID USER JOB_NAME STAT QUEUE FROM_HOST EXEC_HOST SUBMIT_TIME START_TIME TIME_LEFT 131073368 zafers02 *3900ddf6662 PEND premium lc02c03.ch - Jun 27 14:13 - -
-rw-rw-rw- 1 zafers02 nicolp01a 52K Jun 27 14:15 .nextflow.log -rw-rw-rw- 1 zafers02 nicolp01a 0 Jun 27 14:15 job.CNICS.lsf.sh.CAU.e -rw-rw-rw- 1 zafers02 nicolp01a 8.0K Jun 27 14:15 job.CNICS.lsf.sh.CAU.o
I am uploading the
job.CNICS.lsf.sh.CAU.o
file as job.CNICS.lsf.sh.CAU.o.txt and.nextflow.log
file as job1.nextflow.log.txt here. I'm wondering what I'm doing wrong here. Thank you for your time.job.CNICS.lsf.sh.CAU.o.txt job1.nextflow.log.txt