PGScatalog / pgsc_calc

The Polygenic Score Catalog Calculator is a nextflow pipeline for polygenic score calculation
https://pgsc-calc.readthedocs.io/en/latest/
Apache License 2.0
120 stars 21 forks source link

The error of MATCH_COMBINE #335

Closed tsaojack1234 closed 3 months ago

tsaojack1234 commented 4 months ago

Hello, I would like to ask some questions and thank you for the tool.

This is my error description:

Command error:
  pgscatalog.match.cli.merge_cli: 2024-07-09 09:07:50 DEBUG    Verbose logging enabled
  pgscatalog.match.cli.merge_cli: 2024-07-09 09:07:50 INFO     --cleanup set (default), temporary files will be deleted
  pgscatalog.match.lib.scoringfileframe: 2024-07-09 09:07:50 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
  pgscatalog.match.lib.scoringfileframe: 2024-07-09 09:07:50 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
  pgscatalog.match.lib._match.preprocess: 2024-07-09 09:07:50 DEBUG    Complementing columnexecutor >  local (3)
[26/aba7ee] PGS…SCCALC:INPUT_CHECK:COMBINE_SCOREFILES | 1 of 1 ✔
[-        ] PGS…ALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM -
[-        ] PGS…LC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR -
[skipped  ] PGS…IBLE:PLINK2_VCF (cineca chromosome 1) | 1 of 1, stored: 1 ✔
[24/50b9e4] PGS…:MATCH_VARIANTS (cineca chromosome 1) | 1 of 1 ✔
[2c/01b131] PGS…PGSCCALC:MATCH:MATCH_COMBINE (cineca) | 1 of 1, failed: 1 ✘
[-        ] PGS…ALC:PGSCCALC:APPLY_SCORE:PLINK2_SCORE -
[-        ] PGS…:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE -
[-        ] PGS…PGSCCALC:PGSCCALC:REPORT:SCORE_REPORT -
[-        ] PGS…GSCCALC:PGSCCALC:DUMPSOFTWAREVERSIONS -
Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (cineca)'
Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (cineca)` terminated with an error exit status (15)

Command executed:

  export POLARS_MAX_THREADS=2
    pgscatalog-matchmerge  --dataset cineca --scorefile scorefiles.txt.gz  --matches *.ipc.zst --min_overlap 0.75 --outdir $PWD --split -v

  cat <<-END_VERSIONS > versions.yml
  MATCH_COMBINE:
      pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
  END_VERSIONS

Command exit status:
  15

Command output:
  (empty)

This is my command line:

nextflow run pgscatalog/pgsc_calc \
-r v2.0.0-alpha.6 \
-profile conda \
--input SRR515199_PGS000025_vcf.sorted.csv \
--scorefile PGS000025_hmPOS_GRCh38.txt \
--pgs_id PGS000025_hmPOS_GRCh38 \
--liftover \
--target_build GRCh38 \
--hg19_chain hg19ToHg38.over.chain.gz \
--hg38_chain hg38ToHg19.over.chain.gz \
--max_cpus 8 \
--max_memory 31.GB

I tried "v2.0.0-alpha.6", "v2.0.0-alpha.6", and "v2.0.0-beta", but none of them worked.

input_file: SRR515199_PGS000025.sorted.vcf.gz SRR515199_PGS000025_vcf.sorted.csv

environment: nextflow version 24.04.2.5914 Ubuntu 18.04

Best regards

nebfield commented 4 months ago

A few suggestions to get started:

1) The samplesheet says the VCF only contains chromosome 1, but it contains multiple chromosomes. If your target genomes contain multiple chromosomes the chrom column should be empty. 2) Your VCF has low variant density and not many samples. The calculator works best with imputed cohort data. 3) Try nextflow run pgscatalog/pgsc_calc -r main -latest ... to use the main branch

tsaojack1234 commented 4 months ago

Hello, thank you for your answer I followed your suggestions and did the following steps: 1. Retain chr1 sites. 2. Put it into Michigan Imputation Server and obtain "chr1.dose.vcf.gz". 3. Use plink2 to change "chr1.dose.vcf.gz" into "chr1.dose_axy.pgen", "chr1.dose_axy.psam", and "chr1.dose_axy.pvar".

plink2 --vcf chr1.dose.vcf.gz \
--allow-extra-chr \
--chr 1 \
-make-pgen \
--out chr1.dose_axy

4. Finally put it into the main program.

nextflow run pgscatalog/pgsc_calc \
-r main -latest \
-profile conda \
--input chr1.dose.csv \
--scorefile PGS000137_hmPOS_GRCh38.txt \
--pgs_id PGS000137 \
--target_build GRCh38

This is the context of chr1.dose.csv: sampleset,path_prefix,chrom,format cineca,chr1.dose_axy,1,pfile

But I got the error message, like this:

File "/home/yuliangtsao/ext_hdd2/prs/work/conda/pgscatalog-utils-cc52ffcd2b21fb989b3730executor >  local (3)
[45/dfedfb] PGS…s_id:PGS000137, pgp_id:, trait_efo:]) | 1 of 1 ✔
[b4/2a899b] PGS…LC:INPUT_CHECK:COMBINE_SCOREFILES (1) | 1 of 1 ✔
[-        ] PGS…ALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM -
[skipped  ] PGS…NK2_RELABELPVAR (cineca chromosome 1) | 1 of 1, stored: 1 ✔
[-        ] PGS…C:PGSCCALC:MAKE_COMPATIBLE:PLINK2_VCF -
[1d/8e0adf] PGS…:MATCH_VARIANTS (cineca chromosome 1) | 1 of 1, failed: 1 ✘
[-        ] PGS…PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE -
[-        ] PGS…ALC:PGSCCALC:APPLY_SCORE:PLINK2_SCORE -
[-        ] PGS…:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE -
[-        ] PGS…PGSCCALC:PGSCCALC:REPORT:SCORE_REPORT -
[-        ] PGS…GSCCALC:PGSCCALC:DUMPSOFTWAREVERSIONS -
Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (cineca chromosome 1)'

Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (cineca chromosome 1)` terminated with an error exit status (1)

Command executed:

  export POLARS_MAX_THREADS=2
    pgscatalog-match --dataset cineca --scorefile scorefiles.txt.gz --target GRCh38_cineca_1.pvar.zst --only_match --chrom 1                           --outdir $PWD -v

  cat <<-END_VERSIONS > versions.yml
  MATCH_VARIANTS:
      pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 WARNING  No output format specified, writing to combined scoring file
  pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 DEBUG    Verbose logging enabled
  pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 INFO     --cleanup set (default), temporary files will be deleted
  pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
  pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
  pgscatalog.match.lib._match.preprocess: 2024-07-10 08:48:29 DEBUG    Complementing column effect_allele
  pgscatalog.match.lib._match.preprocess: 2024-07-10 08:48:29 DEBUG    Complementing column other_allele
  pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    Filtering scoring file to chromosome 1
  pgscatalog.match.lib.variantframe: 2024-07-10 08:48:29 DEBUG    Converting VariantFrame(path='GRCh38_cineca_1.pvar.zst', dataset='cineca', chrom='1', cleanup=True, tmpdir=PosixPath('tmp')) to feather format
Command error:
  pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 WARNING  No output format specified, writing to combined scoring file
  pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 DEBUG    Verbose logging enabled
  pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 INFO     --cleanup set (default), temporary files will be deleted
  pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
  pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
  pgscatalog.match.lib._match.preprocess: 2024-07-10 08:48:29 DEBUG    Complementing column effect_allele
  pgscatalog.match.lib._match.preprocess: 2024-07-10 08:48:29 DEBUG    Complementing column other_allele
  pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    Filtering scoring file to chromosome 1
  pgscatalog.match.lib.variantframe: 2024-07-10 08:48:29 DEBUG    Converting VariantFrame(path='GRCh38_cineca_1.pvar.zst', dataset='cineca', chrom='1', cleanup=True, tmpdir=PosixPath('tmp')) to feather format
Command error:
  pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 WARNING  No output format specified, writing to combined scoring file
  pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 DEBUG    Verbose logging enabled
  pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 INFO     --cleanup set (default), temporary files will be deleted
  pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
  pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
  pgscatalog.match.lib._match.preprocess: 2024-07-10 08:48:29 DEBUG    Complementing column effect_allele
  pgscatalog.match.lib._match.preprocess: 2024-07-10 08:48:29 DEBUG    Complementing column other_allele
  pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    Filtering scoring file to chromosome 1
  pgscatalog.match.lib.variantframe: 2024-07-10 08:48:29 DEBUG    Converting VariantFrame(path='GRCh38_cineca_1.pvar.zst', dataset='cineca', chrom='1', cleanup=True, tmpdir=PosixPath('tmp')) to feather format
......

Please let me know if I've missed anything, thanks.

nebfield commented 4 months ago

You probably shouldn't be using a single chromosome to calculate a PGS. PGS000137 contains variants from many chromosomes, so it will cause match errors.

The full logs wTould be helpful to understand more. The logs are stored in the working directory of the process that's failing (work/1d/8e0adf.../.command.err)

tsaojack1234 commented 4 months ago

This is file ".command.err":

pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 WARNING  No output format specified, wr
iting to combined scoring file
pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 DEBUG    Verbose logging enabled
pgscatalog.match.cli.match_cli: 2024-07-10 08:48:29 INFO     --cleanup set (default), tempo
rary files will be deleted
pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    Converting ScoringFileF
rame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    ScoringFileFrame(Normal
isedScoringFile('scorefiles.txt.gz')) feather conversion complete
pgscatalog.match.lib._match.preprocess: 2024-07-10 08:48:29 DEBUG    Complementing column e
ffect_allele
pgscatalog.match.lib._match.preprocess: 2024-07-10 08:48:29 DEBUG    Complementing column o
ther_allele
pgscatalog.match.lib.scoringfileframe: 2024-07-10 08:48:29 DEBUG    Filtering scoring file 
to chromosome 1
pgscatalog.match.lib.variantframe: 2024-07-10 08:48:29 DEBUG    Converting VariantFrame(pat
h='GRCh38_cineca_1.pvar.zst', dataset='cineca', chrom='1', cleanup=True, tmpdir=PosixPath('
/home/yuliangtsao/ext_hdd2/prs/work/1d/8e0adfe7a6e6aa8622c62d0585853a/tmp')) to feather for
mat
Traceback (most recent call last):
  File "/home/yuliangtsao/ext_hdd2/prs/work/conda/pgscatalog-utils-cc52ffcd2b21fb989b3730d3
c8b46423/bin/pgscatalog-match", line 10, in <module>
    sys.exit(run_match())
             ^^^^^^^^^^^
  File "/home/yuliangtsao/ext_hdd2/prs/work/conda/pgscatalog-utils-cc52ffcd2b21fb989b3730d3
c8b46423/lib/python3.12/site-packages/pgscatalog/match/cli/match_cli.py", line 87, in run_m
atch
    ipc_path = get_match_candidates(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yuliangtsao/ext_hdd2/prs/work/conda/pgscatalog-utils-cc52ffcd2b21fb989b3730d3c8b46423/lib/python3.12/site-packages/pgscatalog/match/cli/match_cli.py", line 124, in get_match_candidates
with variants as target_df:
  File "/home/yuliangtsao/ext_hdd2/prs/work/conda/pgscatalog-utils-cc52ffcd2b21fb989b3730d3
c8b46423/lib/python3.12/site-packages/pgscatalog/match/lib/variantframe.py", line 54, in __
enter__
    self.arrowpaths = loose(self.variants, tmpdir=self._tmpdir)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yuliangtsao/ext_hdd2/prs/work/conda/pgscatalog-utils-cc52ffcd2b21fb989b3730d3
c8b46423/lib/python3.12/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yuliangtsao/ext_hdd2/prs/work/conda/pgscatalog-utils-cc52ffcd2b21fb989b3730d3
c8b46423/lib/python3.12/site-packages/pgscatalog/match/lib/_arrow.py", line 94, in _
    return batch_read(reader, tmpdir=tmpdir, cols_keep=cols_keep)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yuliangtsao/ext_hdd2/prs/work/conda/pgscatalog-utils-cc52ffcd2b21fb989b3730d3
c8b46423/lib/python3.12/site-packages/pgscatalog/match/lib/_arrow.py", line 102, in batch_r
ead
    batches = reader.next_batches(batch_size)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yuliangtsao/ext_hdd2/prs/work/conda/pgscatalog-utils-cc52ffcd2b21fb989b3730d3
c8b46423/lib/python3.12/site-packages/polars/io/csv/batched_reader.py", line 134, in next_b
atches
    batches = self._reader.next_batches(n)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: found more fields than defined in 'Schema'

Consider setting 'truncate_ragged_lines=True'.

thank you.

nebfield commented 4 months ago

Could you try again with the latest release please:

$ rm -r work/ # delete any existing caches
$ nextflow run pgscatalog/pgsc_calc -r v2.0.0-beta.1 ...