Problem when running on hpc computing cluster.

samreenzafer commented 5 days ago

Hi. I've been able to run a few traits along with my data on command line (on our department's computing cluster) and I'm now trying to scale it for thousands of PGSids, by using the lsf queue system. I've finally been able to get 1 job running as I'll show below, but it fails every time I submit it at a different point. The log files are quite large, so I'll try to upload them here instead of pasting here.

My main job has exited with error but I still see one of the sub-jobs that the workflow creates and submits to the cluster, being in PENDING state on our cluster, which is strange.

I submitted the main job as below.

job=job.CNICS.lsf.sh
pop="CAU"
bsub -J CNICS.$pop  -P acc_rareADRs -q premium -n 2 -W 4:00 -R rusage[mem=5000] -oo $job.$pop.o -eo $job.$pop.e -L /bin/bash sh $job $pop

The shell script looks like this:

[zafers02@li03c02 test_nextflow_CNICSonly]$ cat job.CNICS.lsf.sh

dir=`readlink -f .`
cd $dir 

ml proxies
ml singularity/3.11.0
ml nextflow/24.04.2.5914

pop=$1  ### to get sample sheet it  plink_files/samplesheet.CNICSAFR.csv or CNICSCAU.csv 
mkdir -p `pwd`/genotypes_cache/CNICS/$pop

export NXF_SINGULARITY_CACHEDIR="/---myfullpath--/PRS/pgsc-calc/work/singularity/"
export NXF_ANSI_LOG=false
export NXF_OPTS="-Xms500M -Xmx2G"

nextflow run pgscatalog/pgsc_calc --max_cpus 1 -profile singularity --input plink_files/samplesheet.CNICS${pop}.csv --pgs_id PGS000036,PGS003446,PGS002237,PGS002280 --target_build GRCh37 -w `pwd`/work/  --genotypes_cache `pwd`/genotypes_cache/CNICS/$pop  --run_ancestry /---myfullpath--/PRS/pgsc-calc/resources/pgsc_HGDP+1kGP_v1.tar.zst  --min_overlap 0.20 -c nextflow.lsf.config

And this is what the nextflow.lsf.config file looks like.

[zafers02@li03c02 test_nextflow_CNICSonly]$ cat nextflow.lsf.config

process {
    queue = 'premium'
    clusterOptions = ' -P acc_CranioProject '
    scratch = true

    withLabel:process_low {
        cpus   = 1
        memory = 4.GB
        time   = 2.h
    }
    withLabel:process_medium {
        cpus   = 8
        memory = 64.GB
        time   = 4.h
    }
}

executor {
    name = 'lsf'
    jobName = { "$task.hash" }
}

I still see the following sub-job pending execution on the cluster queue, even though the main job "CNICS.CAU" had exited with error. [zafers02@li03c02 test_nextflow_CNICSonly]$ bjobs JOBID USER JOB_NAME STAT QUEUE FROM_HOST EXEC_HOST SUBMIT_TIME START_TIME TIME_LEFT 131073368 zafers02 *3900ddf6662 PEND premium lc02c03.ch - Jun 27 14:13 - -

-rw-rw-rw- 1 zafers02 nicolp01a 52K Jun 27 14:15 .nextflow.log -rw-rw-rw- 1 zafers02 nicolp01a 0 Jun 27 14:15 job.CNICS.lsf.sh.CAU.e -rw-rw-rw- 1 zafers02 nicolp01a 8.0K Jun 27 14:15 job.CNICS.lsf.sh.CAU.o

I am uploading the job.CNICS.lsf.sh.CAU.o file as job.CNICS.lsf.sh.CAU.o.txt and .nextflow.log file as job1.nextflow.log.txt here. I'm wondering what I'm doing wrong here. Thank you for your time.

job.CNICS.lsf.sh.CAU.o.txt job1.nextflow.log.txt

nebfield commented 2 days ago

On a HPC it's normal for Nextflow to submit many smaller jobs when you use the lsf executor. Pending jobs can sometimes get stuck if Nextflow exits suddenly and doesn't have time to clean up.

When a process exits with code 137, it means a process has been killed because it exceeded requested resources. The EXTRACT_DATABASE process has been killed by your scheduler, which causes the workflow to exit with an error (exit code 1).

Here's a configuration profile I use for UK Biobank: https://github.com/PGScatalog/pgsc_calc/issues/328#issuecomment-2199838475

It works fine for ~150 scores. This configuration does a few things:

1) It will automatically resubmit jobs up to 3 times if they failed because of resource problems, but request more resources 2) It defines the amount of resources precisely needed for each process

samreenzafer commented 1 day ago

Thank You. I tried using your configuartion profile, but changed a few lines as highlighted in BOLD below:

**executor {
    name = 'lsf'
    jobName = { "$task.hash" }
}**

process {
    errorStrategy = 'retry'
    maxRetries = 3
    maxErrors = '-1'
    **executor = 'lsf'
    queue = 'premium'
    clusterOptions = ' -P acc_CranioProject '**

    withName: 'SAMPLESHEET_JSON' {

and so on...

Then I submitted the job with new config file as follows: (I asked for 4 cores, and 64Gb for each) since the module requring the largest process in your config file had such requirements. bsub -J CNICS.$pop -P acc_rareADRs -q premium -n 4 -W 4:00 -R rusage[mem=64000] -oo $job.$pop.o -eo $job.$pop.e -L /bin/bash sh $job $pop I did not get the Error 137, but got a "pgscatalog.core.lib.pgsexceptions.QueryError: Can't query PGS Catalog API" while DOWNLOAD_SCORE attempted it three times, with Exit status 11. I then tried to manually see if the DOWNLOAD_SCORE would work on command line and it in fact did, as show below. So I'm confused if there could be other settings I should consider changing.

[zafers02@li03c02 test_nextflow_CNICSonly]$ singularity shell  ../pgsc-calc/work/singularity/ghcr.io-pgscatalog-pygscatalog-pgscatalog-utils-1.1.2-singularity.img  ^C
[zafers02@li03c02 test_nextflow_CNICSonly]$ cat work/7f/357285bcc8f87e5542676c30dce421/.command.sh 
#!/bin/bash -euo pipefail
pgscatalog-download -i PGS000036 PGS003446 PGS002237 PGS002280                                    -b GRCh37         -o $PWD         -v         -c pgsc_calc/2.0.0-beta

cat <<-END_VERSIONS > versions.yml
DOWNLOAD_SCOREFILES:
    pgscatalog.core: $(echo $(python -c 'import pgscatalog.core; print(pgscatalog.core.__version__)'))
END_VERSIONS
[zafers02@li03c02 test_nextflow_CNICSonly]$ singularity shell  ../pgsc-calc/work/singularity/ghcr.io-pgscatalog-pygscatalog-pgscatalog-utils-1.1.2-singularity.img  
Singularity> pgscatalog-download -i PGS000036 PGS003446 PGS002237 PGS002280                                    -b GRCh37         -o $PWD         -v         -c pgsc_calc/2.0.0-beta
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:56 DEBUG    Verbose logging enabled
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:56 INFO     Setting user agent to pgsc_calc/2.0.0-beta
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:56 INFO     Downloading scoring files that have been harmonised to build=GenomeBuild.GRCh37
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO     Submitting ScoringFile('PGS000036', target_build=GenomeBuild.GRCh37) download
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO     Submitting ScoringFile('PGS002237', target_build=GenomeBuild.GRCh37) download
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO     Submitting ScoringFile('PGS002280', target_build=GenomeBuild.GRCh37) download
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO     Submitting ScoringFile('PGS003446', target_build=GenomeBuild.GRCh37) download
  0%|                                                                                                                                                                        | 0/4 [00:00<?, ?it/s]pgscatalog.core.cli.download_cli: 2024-07-01 12:45:59 INFO     Download complete
 25%|████████████████████████████████████████                                                                                                                        | 1/4 [00:01<00:05,  1.89s/it]pgscatalog.core.cli.download_cli: 2024-07-01 12:46:03 INFO     Download complete
 50%|████████████████████████████████████████████████████████████████████████████████                                                                                | 2/4 [00:05<00:06,  3.19s/it]pgscatalog.core.cli.download_cli: 2024-07-01 12:46:04 INFO     Download complete
 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                        | 3/4 [00:07<00:02,  2.28s/it]pgscatalog.core.cli.download_cli: 2024-07-01 12:46:07 INFO     Download complete
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:10<00:00,  2.53s/it]
pgscatalog.core.cli.download_cli: 2024-07-01 12:46:07 INFO     All downloads finished
Singularity>

Here is my nextflow log file job2.nextflow.log.txt and job_output_log file job2.job.CNICS.lsf.sh.CAU.o.txt

I am going to try deleting the entire work folder and re-run the job.

samreenzafer commented 1 day ago

Can I download all PGS traits before hand and ask nextflow to use the downloaded files from a direcotry, rather than trying to download files live when the pipeline is running? Something similar to the reference files?

nebfield commented 1 day ago

You could use pgscatalog-download to preload scoring files

The --scorefile parameter supports multiple local scoring files.

You can install the pgscatalog.core package with pip or bioconda.

samreenzafer commented 14 hours ago

Thanks. I downloaded some PRS score files and then ran a job testing 1 PRS trait ( using --scorefile) and it runs to completion. So I've bypassed the issue of the job not being able to execute the "DOWNLOAD_SCORE" step.

I will now begin benchmarking the resource requirements for multiple traits since we intend to run all ~4800 traits from PGScatalog for both our CAU and AFR subset of target data which each have >4000 samples. I did have to reduce all the memory requirements for subtasks (in the config file) to 8GB for anything more than 8GB, otherwise my jobs were pending on the cluster queues without entering the "run" stage.

If you ever figure out the original "pgscatalog.core.lib.pgsexceptions.QueryError: Can't query PGS Catalog API" error, or if our HPC informaticians help me figure this out, I will update here.

thank you.

PGScatalog / pgsc_calc

Problem when running on hpc computing cluster. #326