PGScatalog / pgsc_calc

The Polygenic Score Catalog Calculator is a nextflow pipeline for polygenic score calculation
https://pgsc-calc.readthedocs.io/en/latest/
Apache License 2.0
114 stars 21 forks source link

MATCH_VARIANTS always fails on 'Converting VariantFrame to feather format' #329

Closed csjohnson23 closed 3 months ago

csjohnson23 commented 3 months ago

Description of the bug

I'm running the singularity version of the tool on a CentOS 7 Linux cluster. Every time I attempt to run it, whether directly via calling "nextflow run..." or via submitting to the cluster, the pipeline fails in the MATCH_VARIANT step. Specifically, it always fails at the point of "Converting VariantFrame(path='GRCh37_PROFILE2024_20.pvar.zst', dataset='PROFILE2024', chrom='20', cleanup=True, tmpdir=PosixPath('/tmp/nxf.jGbU3cW4rQ/tmp')) to feather format". I get the following error: "polars.exceptions.ComputeError: found more fields than defined in 'Schema".

So far, I've tried switching the input type (pfile vs vcf), switching the job scheduler (slurm vs lsf), and changing the sample sheet to be one combined file vs files split by chromosome, with the same result. I always remove the work and results directories before re-testing. The tests included with the pipeline run perfectly. I have a feeling it may have something to do temporary files not being handled as expected by my HPC.

I'd appreciate any help in debugging this error! Especially if there are any options I should set that might change how files in the matchtmp directory are handled.

samplesheet, config script, and run_pgscalc script are attached.

Command used and terminal output

N E X T F L O W  ~  version 24.04.2
Launching `https://github.com/pgscatalog/pgsc_calc` [trusting_pauling] DSL2 - revision: ccfd6367d5 [main]

------------------------------------------------------
  pgscatalog/pgsc_calc v2.0.0-beta-gccfd636
------------------------------------------------------
Core Nextflow options
  revision          : main
  runName           : trusting_pauling
  containerEngine   : singularity
  launchDir         : /PHShome/cj773/pgs_calc_2
  workDir           : /PHShome/cj773/pgs_calc_2/work
  projectDir        : /PHShome/cj773/.nextflow/assets/pgscatalog/pgsc_calc
  userName          : cj773
  profile           : singularity
  configFiles       : 

Input/output options
  input             : samplesheet.csv
  pgs_id            : PGS004725
  outdir            : /PHShome/cj773/pgs_calc_2/results

Reference options
  ref_samplesheet   : /PHShome/cj773/.nextflow/assets/pgscatalog/pgsc_calc/assets/ancestry/reference.csv
  ld_grch37         : /PHShome/cj773/.nextflow/assets/pgscatalog/pgsc_calc/assets/ancestry/high-LD-regions-hg19-GRCh37.txt
  ld_grch38         : /PHShome/cj773/.nextflow/assets/pgscatalog/pgsc_calc/assets/ancestry/high-LD-regions-hg38-GRCh38.txt
  ancestry_checksums: /PHShome/cj773/.nextflow/assets/pgscatalog/pgsc_calc/assets/ancestry/checksums.txt

Compatibility options
  target_build      : GRCh37

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use pgscatalog/pgsc_calc for your analysis please cite:

* The Polygenic Score Catalog
  https://doi.org/10.1101/2024.05.29.24307783
  https://doi.org/10.1038/s41588-021-00783-5

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/pgscatalog/pgsc_calc/blob/main/CITATIONS.md

WARN: Singularity cache directory has not been defined -- Remote image will be stored in the path: /PHShome/cj773/pgs_calc_2/work/singularity -- Use the environment variable NXF_SINGULARITY_CACHEDIR to specify a different location
Pulling Singularity image oras://ghcr.io/pgscatalog/plink2:2.00a5.10-singularity [cache /PHShome/cj773/pgs_calc_2/work/singularity/ghcr.io-pgscatalog-plink2-2.00a5.10-singularity.img]
Pulling Singularity image oras://ghcr.io/pgscatalog/pygscatalog:pgscatalog-utils-1.1.2-singularity [cache /PHShome/cj773/pgs_calc_2/work/singularity/ghcr.io-pgscatalog-pygscatalog-pgscatalog-utils-1.1.2-singularity.img]
[dc/4e8a14] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 2)
[00/12e513] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 1)
[78/d45ebd] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 3)
[4e/cbf8f2] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 4)
[ba/25ea5e] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 5)
[bd/450c07] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 7)
[d4/6fe673] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 6)
[08/9f3d20] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 9)
[4a/07e337] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 8)
[bc/7eb6b6] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 11)
[71/3e8714] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 10)
[a8/e036ff] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 12)
[b5/bbf717] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 13)
[60/00cc09] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 14)
[8d/fc3e6d] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 15)
[e6/d181de] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 16)
[f3/0763ee] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 17)
[55/10cbf3] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 19)
[3e/6bc08d] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 18)
[c4/afe3b1] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 20)
[07/325863] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 22)
[dd/16aea7] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (PROFILE2024 chromosome 21)
[6a/bdb7fd] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:DOWNLOAD_SCOREFILES ([pgs_id:PGS004725, pgp_id:, trait_efo:])
[b7/e18b6f] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:INPUT_CHECK:COMBINE_SCOREFILES (1)
[da/e611d2] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (PROFILE2024 chromosome 20)
[5d/99ad3e] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (PROFILE2024 chromosome 22)
[2e/4f5b19] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (PROFILE2024 chromosome 17)
ERROR ~ Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (PROFILE2024 chromosome 20)'

Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (PROFILE2024 chromosome 20)` terminated with an error exit status (1)

Command executed:

  export POLARS_MAX_THREADS=8

  pgscatalog-match                  --dataset PROFILE2024         --scorefile scorefiles.txt.gz         --target GRCh37_PROFILE2024_20.pvar.zst         --only_match         --chrom 20                           --outdir $PWD         -v

  cat <<-END_VERSIONS > versions.yml
  MATCH_VARIANTS:
      pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  pgscatalog.match.cli.match_cli: 2024-07-01 10:23:26 WARNING  No output format specified, writing to combined scoring file
  pgscatalog.match.cli.match_cli: 2024-07-01 10:23:26 DEBUG    Verbose logging enabled
  pgscatalog.match.cli.match_cli: 2024-07-01 10:23:26 INFO     --cleanup set (default), temporary files will be deleted
  pgscatalog.match.lib.scoringfileframe: 2024-07-01 10:23:26 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
  pgscatalog.match.lib.scoringfileframe: 2024-07-01 10:23:28 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
  pgscatalog.match.lib._match.preprocess: 2024-07-01 10:23:28 DEBUG    Complementing column effect_allele
  pgscatalog.match.lib._match.preprocess: 2024-07-01 10:23:28 DEBUG    Complementing column other_allele
  pgscatalog.match.lib.scoringfileframe: 2024-07-01 10:23:28 DEBUG    Filtering scoring file to chromosome 20
  pgscatalog.match.lib.variantframe: 2024-07-01 10:23:28 DEBUG    Converting VariantFrame(path='GRCh37_PROFILE2024_20.pvar.zst', dataset='PROFILE2024', chrom='20', cleanup=True, tmpdir=PosixPath('/tmp/nxf.jGbU3cW4rQ/tmp')) to feather format
  Traceback (most recent call last):
    File "/app/pgscatalog.utils/.venv/bin/pgscatalog-match", line 8, in <module>
      sys.exit(run_match())
               ^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/match_cli.py", line 87, in run_match
      ipc_path = get_match_candidates(
                 ^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/match_cli.py", line 124, in get_match_candidates
      with variants as target_df:
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/variantframe.py", line 54, in __enter__
      self.arrowpaths = loose(self.variants, tmpdir=self._tmpdir)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper
      return dispatch(args[0].__class__)(*args, **kw)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/_arrow.py", line 94, in _
      return batch_read(reader, tmpdir=tmpdir, cols_keep=cols_keep)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/_arrow.py", line 102, in batch_read
      batches = reader.next_batches(batch_size)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/polars/io/csv/batched_reader.py", line 134, in next_batches
      batches = self._reader.next_batches(n)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  polars.exceptions.ComputeError: found more fields than defined in 'Schema'

  Consider setting 'truncate_ragged_lines=True'.

Work dir:
  /PHShome/cj773/pgs_calc_2/work/da/e611d2fed8a52aed316ed62fad0267

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details
Execution cancelled -- Finishing pending tasks before exit
ERROR ~ ERROR: Matching subworkflow failed

 -- Check '.nextflow.log' file for details
ERROR ~ ERROR: No scores calculated!

 -- Check '.nextflow.log' file for details
ERROR ~ ERROR: No results report written!

 -- Check '.nextflow.log' file for details

Relevant files

Samplesheet: pfile_samplesheet.csv config file:

process {
    clusterOptions = ''
    scratch = true

    withLabel:process_low {
        queue = 'normal'
        cpus   = 2
        memory = 8.GB
        time   = 1.h
    }
    withLabel:process_medium {
        queue = 'bigmem'
        cpus   = 8
        memory = 64.GB
        time   = 4.h
    }
}

executor {
    name = 'slurm'
    jobName = { "$task.hash" }
    submitOptions = '-N 1'
}

run_pgscalc_test.sh:

#!/bin/bash

export NXF_ANSI_LOG=false
export NXF_OPTS="-Xms500M -Xmx2G"

module load nextflow
module load singularity/3.7.0

nextflow run pgscatalog/pgsc_calc \
    -profile singularity \
    --input samplesheet.csv \
    --pgs_id PGS004725 \
    --target_build GRCh37 \
    -c /PHShome/cj773/configs/pgs_calc_test.config

.nextflow.log output: nextflow.log

System information

Nextflow version: 24.04.2 Hardware: HPC Executor: slurm, lsf Container engine: Singularity OS: CentOS 7 Linux Version of pgsc_calc: v2.0.0-beta-gccfd636

nebfield commented 3 months ago

Thanks for the bug report. Could you test with the development branch please?

$ nextflow run pgscatalog/pgsc_calc -r dev -latest \
    -profile singularity \
    --input samplesheet.csv \
    --pgs_id PGS004725 \
    --target_build GRCh37 \
    -c /PHShome/cj773/configs/pgs_calc_test.config

Some VCFs are causing unexpected columns to be written to the variant information files. I just updated this branch to write a consistent column set which should hopefully help.

You'll also need to delete the work directory to remove the cache.

csjohnson23 commented 3 months ago

Tried it a few times, but the dev branch consistently breaks on DOWNLOAD_SCOREFILES with a socket timeout. I tried with submitting to a couple different partitions to make sure its not a cluster issue, and kept all else the same. This step passes on the non-dev branch.

Error is as follows: "[b1/d738c6] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:DOWNLOAD_SCOREFILES ([pgs_id:PGS004725, pgp_id:, trait_efo:]) WARN: [SLURM] queue (normal) status cannot be fetched

nebfield commented 3 months ago

There are no changes to the download process on the dev branch, which is typically reliable. Socket timeouts are more likely to be a cluster problem, or perhaps a temporary network issue.

If the problem is consistent perhaps you could use pgscatalog-download to preload scoring files?

You could use the --scorefile parameter to use the downloaded scoring files.

You can install the pgscatalog.core package with pip or bioconda.

Fiwx commented 3 months ago

I also got this error as well.

Some VCFs are causing unexpected columns to be written to the variant information files. I just updated this branch to write a consistent column set which should hopefully help.

The VCF I am testing with has ran successfully before. I tried the development version, which fixed the error and the pipeline completed successfully.


  pgscatalog.match.cli.match_cli: 2024-07-02 18:44:40 DEBUG    Verbose logging enabled
  pgscatalog.match.cli.match_cli: 2024-07-02 18:44:40 INFO     --cleanup set (default), temporary files will be deleted
  pgscatalog.match.lib.scoringfileframe: 2024-07-02 18:44:40 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
  pgscatalog.match.lib.scoringfileframe: 2024-07-02 18:44:41 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
  pgscatalog.match.lib._match.preprocess: 2024-07-02 18:44:41 DEBUG    Complementing column effect_allele
  pgscatalog.match.lib._match.preprocess: 2024-07-02 18:44:41 DEBUG    Complementing column other_allele
  pgscatalog.match.lib.variantframe: 2024-07-02 18:44:41 DEBUG    Converting VariantFrame(path='GRCh37_newautosomal_ALL.pvar.zst', dataset='newautosomal', chrom=None, cleanup=True, tmpdir=PosixPath('tmp')) to feather format
  Traceback (most recent call last):
    File "/app/pgscatalog.utils/.venv/bin/pgscatalog-match", line 8, in <module>
      sys.exit(run_match())
               ^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/match_cli.py", line 87, in run_match
      ipc_path = get_match_candidates(
                 ^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/match_cli.py", line 124, in get_match_candidates
      with variants as target_df:
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/variantframe.py", line 54, in __enter__
      self.arrowpaths = loose(self.variants, tmpdir=self._tmpdir)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper
      return dispatch(args[0].__class__)(*args, **kw)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/_arrow.py", line 94, in _
      return batch_read(reader, tmpdir=tmpdir, cols_keep=cols_keep)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/_arrow.py", line 102, in batch_read
      batches = reader.next_batches(batch_size)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/polars/io/csv/batched_reader.py", line 134, in next_batches
      batches = self._reader.next_batches(n)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  polars.exceptions.ComputeError: found more fields than defined in 'Schema'

  Consider setting 'truncate_ragged_lines=True'.```
csjohnson23 commented 3 months ago

Still have yet to make it to MATCH_VARIANTS on the dev branch to test the solution; pipeline gets stuck even when using downloaded files.

Also, when I try to run

 nextflow run pgscatalog/pgsc_calc \
    -profile singularity \
    --input samplesheet.csv \
    --scorefile downloads/PGS004725_hmPOS_GRCh37.txt.gz \
    --target_build GRCh37 \
    -c ~/configs/pgs_calc_test_slurm.config

it can no longer run, I get this error:

Project pgscatalog/pgsc_calc is currently stuck on revision: dev -- you need to explicitly specify a revision with the option `-r` in order to use it

Is there a way I could 1) update the my input vcfs to match the expected columns for the published v2 package and 2) circumvent the error above to be able to use the published version again?

smlmbrt commented 3 months ago

@csjohnson23, you need to add -r dev to your command and it should fix that error and run. If you want to edit VCFs you can truncate to only have standard pvar columns (https://www.cog-genomics.org/plink/2.0/formats#pvar).

SalemWerdyani commented 3 months ago

After the latest updates of the pipeline, I started to have the exact same problem using docker and condo without any change in my data (BED, PGEN, or VCF).

smlmbrt commented 3 months ago

After the latest updates of the pipeline, I started to have the exact same problem using docker and condo without any change in my data.

@SalemWerdyani, yes it's because we slightly changed the way we read the variant files. The current dev branch (adding -r dev) should solve the problem for now if you don't want to truncate your VCF data columns.

SalemWerdyani commented 3 months ago

I saw the earlier comments and tried adding -r dev, it did not work

smlmbrt commented 3 months ago

I saw the earlier comments and tried adding -r dev, it did not work

Could you let us know what the error was? You may also have to add -latest if you copy of the repo is out of date.

SalemWerdyani commented 3 months ago

Thanks, Sam for the help and support, It works now

nebfield commented 3 months ago

I've merged the fix now:

$ nextflow run pgscatalog/pgsc_calc -latest ...

should work OK, but the specific release -r v2.0.0-beta will remain broken until we do a new patch.

Thanks everybody for reporting their experiences 😄