PGScatalog / pgsc_calc

The Polygenic Score Catalog Calculator is a nextflow pipeline for polygenic score calculation
https://pgsc-calc.readthedocs.io/en/latest/
Apache License 2.0
117 stars 21 forks source link

Error in APPLY_SCORE:SCORE_AGGREGATE (reference) #308

Closed uelandte closed 5 months ago

uelandte commented 5 months ago

Hi pgsc_calc team!

I had recently started a discussion about using pgsc_calc within cloud VM environments that have limited permissions built in (Terra, All of Us). Since then, I have just tried to spin up a VM directly through Google's Compute Engine and been able to run the test profile using Docker.

However, when running our genotypes through the pipeline, we almost get through it all except for an error at the very end with APPLY_SCORE:SCORE_AGGREGATE. The .nextflow.log is shown below. It appears to not like that the biobank file is larger than the reference file.

Any suggestions on how we might resolve this? Thanks for your time!

-Thomas

Nextflow call

nextflow run pgscatalog/pgsc_calc -profile docker \
  --input "/home/thomas_ueland/samplesheet.csv" \
  --format "csv" \
  --scorefile "/home/thomas_ueland/prscsx_combined-weight-file-forpgsccalc_MVPUKB-EURAFR-Ch38.txt" \
  --target_build GRCh38 \
  --outdir /home/thomas_ueland/pgsc_calc_output \
  --run_ancestry /home/thomas_ueland/pgsc_HGDP+1kGP_v1.tar.zst \
  --min_overlap 0.0

Truncated .Nextflow.log

  May-31 00:50:05.980 [main] DEBUG nextflow.cli.Launcher - $> nextflow run pgscatalog/pgsc_calc -profile docker --input /home/thomas_ueland/samplesheet.csv --format csv --scorefile /home/thomas_ueland/prscsx_combined-weight-file-forpgsccalc_MVPUKB-EURAFR-Ch38.txt --target_build GRCh38 --outdir /home/thomas_ueland/pgsc_calc_output --run_ancestry /home/thomas_ueland/pgsc_HGDP+1kGP_v1.tar.zst --min_overlap 0.0
May-31 00:50:06.164 [main] DEBUG nextflow.cli.CmdRun - N E X T F L O W  ~  version 24.04.2
May-31 00:50:06.205 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/home/thomas_ueland/.nextflow/plugins; core-plugins: nf-amazon@2.5.2,nf-azure@1.6.0,nf-cloudcache@0.4.1,nf-codecommit@0.2.0,nf-console@1.1.3,nf-ga4gh@1.3.0,nf-google@1.13.2,nf-tower@1.9.1,nf-wave@1.4.2
May-31 00:50:06.226 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
May-31 00:50:06.228 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
May-31 00:50:06.233 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.10.0 in 'deployment' mode
May-31 00:50:06.259 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
May-31 00:50:06.285 [main] DEBUG nextflow.scm.ProviderConfig - Using SCM config path: /home/thomas_ueland/.nextflow/scm
May-31 00:50:08.227 [main] DEBUG nextflow.scm.AssetManager - Git config: /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/.git/config; branch: main; remote: origin; url: https://github.com/PGScatalog/pgsc_calc.git
May-31 00:50:08.276 [main] DEBUG nextflow.scm.RepositoryFactory - Found Git repository result: [RepositoryFactory]
May-31 00:50:08.291 [main] DEBUG nextflow.scm.AssetManager - Git config: /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/.git/config; branch: main; remote: origin; url: https://github.com/PGScatalog/pgsc_calc.git
May-31 00:50:09.193 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/nextflow.config
May-31 00:50:09.198 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/nextflow.config
May-31 00:50:09.237 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /home/thomas_ueland/.nextflow/secrets/store.json
May-31 00:50:09.242 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@2db6d68d] - activable => nextflow.secret.LocalSecretsProvider@2db6d68d
May-31 00:50:09.266 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `docker`
May-31 00:50:10.592 [main] DEBUG nextflow.config.ConfigBuilder - Available config profiles: [bih, cfc_dev, uzl_omics, ifb_core, denbi_qbic, alice, mjolnir_globe, uppmax, incliva, ilifu, uge, rosalind_uge, lugh, mccleary, unibe_ibu, vai, czbiohub_aws, jax, ccga_med, scw, unc_longleaf, tigem, tubingen_apg, google, apollo, ipop_up, vsc_calcua, pdc_kth, googlels, daisybio, eddie, medair, biowulf, apptainer, bi, bigpurple, adcra, cedars, pawsey_setonix, vsc_kul_uhasselt, pawsey_nimbus, ucl_myriad, utd_ganymede, charliecloud, icr_davros, ceres, munin, arm, rosalind, hasta, cfc, uzh, ebi_codon_slurm, ebc, ku_sund_dangpu, ccga_dx, crick, marvin, biohpc_gen, shifter, mana, mamba, york_viking, unc_lccc, wehi, awsbatch, wustl_htcf, arcc, imperial, maestro, software_license, utd_europa, genotoul, nci_gadi, abims, janelia, nu_genomics, googlebatch, oist, sahmri, mpcdf, leicester, vsc_ugent, create, sage, cambridge, jex, podman, ebi_codon, cheaha, xanadu, nyu_hpc, test, marjorie, computerome, ucd_sonic, seg_globe, sanger, dkfz, pasteur, einstein, ethz_euler, m3c, test_full, imb, ucl_cscluster, tuos_stanage, azurebatch, hki, crukmi, csiro_petrichor, qmul_apocrita, docker, engaging, gis, hypatia, psmn, eva, nygc, fgcz, conda, crg, singularity, self_hosted_runner, tufts, uw_hyak_pedslabs, utd_sysbio, debug, genouest, cbe, phoenix, gitpod, seawulf, uod_hpc, fub_curta, uct_hpc, aws_tower, binac]
May-31 00:50:10.653 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 from script declaration
May-31 00:50:10.654 [main] DEBUG nextflow.cli.CmdRun - Launching `https://github.com/pgscatalog/pgsc_calc` [nostalgic_meucci] DSL2 - revision: 01980336fc [main]
May-31 00:50:10.656 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[]
May-31 00:50:10.657 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins resolved requirement=[]
May-31 00:50:10.736 [main] DEBUG nextflow.Session - Session UUID: aaca916d-33bd-46c9-ae8d-ba1dfffcb34a
May-31 00:50:10.736 [main] DEBUG nextflow.Session - Run name: nostalgic_meucci
May-31 00:50:10.737 [main] DEBUG nextflow.Session - Executor pool size: 4
May-31 00:50:10.750 [main] DEBUG nextflow.file.FilePorter - File porter settings maxRetries=3; maxTransfers=50; pollTimeout=null
May-31 00:50:10.757 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=12; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
May-31 00:50:10.787 [main] DEBUG nextflow.cli.CmdRun - 
  Version: 24.04.2 build 5914
  Created: 29-05-2024 06:19 UTC 
  System: Linux 6.1.0-21-cloud-amd64
  Runtime: Groovy 4.0.21 on OpenJDK 64-Bit Server VM 17.0.11+9-Debian-1deb12u1
  Encoding: UTF-8 (UTF-8)
  Process: 15472@pgsc-calc-vm-XXXXX [10.128.0.5]
  CPUs: 4 - Mem: 31.4 GB (7.1 GB) - Swap: 0 (0)
May-31 00:50:10.807 [main] DEBUG nextflow.Session - Work-dir: /home/thomas_ueland/work [ext2/ext3]
May-31 00:50:10.808 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/bin
May-31 00:50:10.821 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[]
May-31 00:50:10.837 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
May-31 00:50:10.971 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory
May-31 00:50:10.985 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 5; maxThreads: 1000
May-31 00:50:11.115 [main] DEBUG nextflow.Session - Session start
May-31 00:50:11.120 [main] DEBUG nextflow.trace.TraceFileObserver - Workflow started -- trace file: /home/thomas_ueland/pgsc_calc_output/pipeline_info/execution_trace_2024-05-31_00-50-10.txt
May-31 00:50:11.130 [main] DEBUG nextflow.Session - Using default localLib path: /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/lib
May-31 00:50:11.136 [main] DEBUG nextflow.Session - Adding to the classpath library: /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/lib
May-31 00:50:11.908 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
May-31 00:50:12.559 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-validation version: latest
May-31 00:50:12.859 [main] INFO  org.pf4j.AbstractPluginManager - Plugin 'nf-validation@1.1.3' resolved
May-31 00:50:12.859 [main] INFO  org.pf4j.AbstractPluginManager - Start plugin 'nf-validation@1.1.3'
May-31 00:50:12.869 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started nf-validation@1.1.3
May-31 00:50:12.871 [main] DEBUG nextflow.script.IncludeDef - Loading included plugin extensions with names: [paramsSummaryLog:paramsSummaryLog, paramsSummaryMap:paramsSummaryMap]; plugin Id: nf-validation
May-31 00:50:12.951 [main] WARN  nextflow.script.ScriptBinding - Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`
~> TaskHandler[id: 20; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:FRAPOSA_PROJECT (biovu); status: RUNNING; exit: -; error: -; workDir: /home/thomas_ueland/work/62/1cf9dc51e3c775d21130da2e44b5bc]
May-31 03:10:24.346 [Task submitter] DEBUG n.processor.TaskPollingMonitor - %% executor local > tasks in the submission queue: 1 -- tasks to be submitted are shown below
~> TaskHandler[id: 21; name: PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE (reference); status: NEW; exit: -; error: -; workDir: /home/thomas_ueland/work/7b/f40429fd46acdc6c668bb4b0c4b95f]
May-31 03:11:24.415 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 20; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:FRAPOSA_PROJECT (biovu); status: COMPLETED; exit: 0; error: -; workDir: /home/thomas_ueland/work/62/1cf9dc51e3c775d21130da2e44b5bc]
May-31 03:11:24.429 [Task submitter] DEBUG n.executor.local.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
May-31 03:11:24.431 [Task submitter] INFO  nextflow.Session - [7b/f40429] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE (reference)
May-31 03:11:29.953 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 21; name: PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE (reference); status: COMPLETED; exit: 1; error: -; workDir: /home/thomas_ueland/work/7b/f40429fd46acdc6c668bb4b0c4b95f]
May-31 03:11:29.964 [TaskFinalizer-8] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE (reference); work-dir=/home/thomas_ueland/work/7b/f40429fd46acdc6c668bb4b0c4b95f
  error [nextflow.exception.ProcessFailedException]: Process `PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE (reference)` terminated with an error exit status (1)
May-31 03:11:30.033 [TaskFinalizer-8] ERROR nextflow.processor.TaskProcessor - Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE (reference)'

Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE (reference)` terminated with an error exit status (1)

Command executed:

  pgscatalog-aggregate -s reference_ALL_additive_0.sscore.zst biovu_ALL_additive_0.sscore.zst -o . -v --no-split

  cat <<-END_VERSIONS > versions.yml
  SCORE_AGGREGATE:
      pgscatalog.calc: $(echo $(python -c 'import pgscatalog.calc; print(pgscatalog.calc.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/usr/local/bin/pgscatalog-aggregate", line 8, in <module>
      sys.exit(run_aggregate())
               ^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pgscatalog/calc/cli/aggregate_cli.py", line 31, in run_aggregate
      aggregated = functools.reduce(operator.add, pgs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pgscatalog/calc/lib/polygenicscore.py", line 337, in __add__
      for df1, df2 in zip(self, other, strict=True):
  ValueError: zip() argument 2 is longer than argument 1

Work dir:
  /home/thomas_ueland/work/7b/f40429fd46acdc6c668bb4b0c4b95f

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
May-31 03:11:30.045 [Actor Thread 44] ERROR nextflow.Nextflow - ERROR: No scores calculated!
May-31 03:11:30.060 [TaskFinalizer-8] INFO  nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
May-31 03:11:30.066 [Actor Thread 48] ERROR nextflow.Nextflow - ERROR: No results report written!
May-31 03:11:30.072 [main] DEBUG nextflow.Session - Session await > all processes finished
May-31 03:11:30.073 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: local) - terminating tasks monitor poll loop
May-31 03:11:30.073 [main] DEBUG nextflow.Session - Session await > all barriers passed

Comparing sizes of biobank (BioVU) file and reference file in the work directory that failed

drwxr-xr-x 2 thomas_ueland thomas_ueland  4096 May 31 03:11 .
drwxr-xr-x 3 thomas_ueland thomas_ueland  4096 May 31 01:34 ..
-rw-r--r-- 1 thomas_ueland thomas_ueland     0 May 31 03:11 .command.begin
-rw-r--r-- 1 thomas_ueland thomas_ueland   594 May 31 03:11 .command.err
-rw-r--r-- 1 thomas_ueland thomas_ueland   594 May 31 03:11 .command.log
-rw-r--r-- 1 thomas_ueland thomas_ueland     0 May 31 03:11 .command.out
-rw-r--r-- 1 thomas_ueland thomas_ueland 10504 May 31 03:11 .command.run
-rw-r--r-- 1 thomas_ueland thomas_ueland   306 May 31 03:11 .command.sh
-rw-r--r-- 1 thomas_ueland thomas_ueland     0 May 31 03:11 .command.trace
-rw-r--r-- 1 thomas_ueland thomas_ueland     1 May 31 03:11 .exitcode
lrwxrwxrwx 1 thomas_ueland thomas_ueland    90 May 31 03:11 biovu_ALL_additive_0.sscore.zst -> /home/thomas_ueland/work/a0/136599778f8863b30d03a750779147/biovu_ALL_additive_0.sscore.zst
lrwxrwxrwx 1 thomas_ueland thomas_ueland    94 May 31 03:11 reference_ALL_additive_0.sscore.zst -> /home/thomas_ueland/work/d5/d88826bbc0b58a7ef416c2614859fa/reference_ALL_additive_0.sscore.zst

Nextflow test profile run (successful)

Launching `https://github.com/pgscatalog/pgsc_calc` [marvelous_fourier] DSL2 - revision: 01980336fc [main]

WARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`

------------------------------------------------------
  pgscatalog/pgsc_calc v2.0.0-alpha.6-g0198033
------------------------------------------------------
Core Nextflow options
  revision                  : main
  runName                   : marvelous_fourier
  containerEngine           : docker
  launchDir                 : /home/thomas_ueland
  workDir                   : /home/thomas_ueland/work
  projectDir                : /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc
  userName                  : thomas_ueland
  profile                   : test,docker
  configFiles               : 

Input/output options
  input                     : /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/assets/examples/samplesheet.csv
  scorefile                 : /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/assets/examples/scorefiles/PGS001229_22.txt
  outdir                    : /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/results

Reference options
  ref_samplesheet           : /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/assets/ancestry/reference.csv
  ld_grch37                 : /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/assets/ancestry/high-LD-regions-hg19-GRCh37.txt
  ld_grch38                 : /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/assets/ancestry/high-LD-regions-hg38-GRCh38.txt
  ancestry_checksums        : /home/thomas_ueland/.nextflow/assets/pgscatalog/pgsc_calc/assets/ancestry/checksums.txt

Compatibility options
  target_build              : GRCh37

Max job request options
  max_cpus                  : 2
  max_memory                : 6.GB
  max_time                  : 6.h

Other parameters
  config_profile_name       : Test profile
  config_profile_description: Minimal test dataset to check pipeline function

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use pgscatalog/pgsc_calc for your analysis please cite:

* The Polygenic Score Catalog
  https://doi.org/10.1038/s41588-021-00783-5

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/pgscatalog/pgsc_calc/blob/master/CITATIONS.md

executor >  local (8)
[fb/0b9c2f] process > PGSCATALOG_PGSCCALC:PGSCCALC:INPUT_CHECK:COMBINE_SCOREFILES (1)                                     [100%] 1 of 1 ✔
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM                                      -
[32/fd936b] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (cineca chromosome 22)              [100%] 1 of 1 ✔
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_VCF                                             -
[fa/596612] process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (cineca chromosome 22)                            [100%] 1 of 1 ✔
[8a/306319] process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (cineca)                                           [100%] 1 of 1 ✔
[9d/096426] process > PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:PLINK2_SCORE (cineca chromosome 22 effect type additive 0) [100%] 1 of 1 ✔
[8e/6c0c3c] process > PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE (cineca)                                   [100%] 1 of 1 ✔
[23/2f3112] process > PGSCATALOG_PGSCCALC:PGSCCALC:REPORT:SCORE_REPORT (cineca)                                           [100%] 1 of 1 ✔
[9c/223816] process > PGSCATALOG_PGSCCALC:PGSCCALC:DUMPSOFTWAREVERSIONS (1)                                               [100%] 1 of 1 ✔
-[pgscatalog/pgsc_calc] Pipeline completed successfully-
Completed at: 31-May-2024 03:48:21
Duration    : 2m 14s
CPU hours   : (a few seconds)
Succeeded   : 8
nebfield commented 5 months ago

Thanks for the bug report! Are you able to run either the test profile or your real data successfully with v2 alpha 5? (-r v2.0.0-alpha.5)

We're still working out some problems that appeared with v2 alpha 6, so we've marked it as pre-release (i.e. not good for production) and have just updated the default branch to v2 alpha 5.

uelandte commented 5 months ago

That's it! The v2 alpha 5 works great. Thanks again for helping get us up and running with this!

-Thomas