PGScatalog / pgsc_calc

The Polygenic Score Catalog Calculator is a nextflow pipeline for polygenic score calculation
https://pgsc-calc.readthedocs.io/en/latest/
Apache License 2.0
107 stars 20 forks source link

Error - PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:MATCH_COMBINE #86

Closed gmmhe closed 1 year ago

gmmhe commented 1 year ago

Hi,

I'm trying to run my first polygenic risk scores using PGS catalog. But I found an issue. I copy the error code below. I think the problem that I have, involves the preparation of my input genomes. I used plink2 v2.00a3.7 64-bit and I set up the chromosomes using your example code following this documentation https://pgsc-calc.readthedocs.io/en/dev/how-to/prepare.html :

./plink2 --vcf chr21.merged.clean.noMono.vcf.gz \ --allow-extra-chr \ --chr 1-22, X, Y, XY \ --make-pgen --out chr21_axy

When running my command in pgsc_calc, I run this:

./nextflow run pgscatalog/pgsc_calc \ -profile docker \ --input samplesheet3.csv --target_build GRCh38 \ --pgs_id PGS 000027 --target_build GRCh38

It seems that the problem is with -chrom parameter, but I was following the steps (I did not use all the chromosomes yet, I tried first with 1 chromosome and later with 3, but I don't think this is the problem). So I cannot see where is the issue. Copy here the error:

Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:MATCH_COMBINE (cineca)'

Caused by: Process PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:MATCH_COMBINE (cineca) terminated with an error exit status (1) executor > local (7) [70/1bcb7b] process > PGSCATALOG_PGSCALC:PGSCALC:DOWNLOAD_SCOREFILES ([pgs_id:PGS000027, pgp_id:, trait_efo:]) [100%] 1 of 1 ✔ [8a/80ab3a] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:SAMPLESHEET_JSON (samplesheet3.csv) [100%] 1 of 1 ✔ [7f/1649ca] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:COMBINE_SCOREFILES (1) [100%] 1 of 1 ✔ [- ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM - [skipped ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (cineca chromosome 21) [100%] 3 of 3, stored: 3 ✔ [- ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF - [84/d0940d] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:MATCH_VARIANTS (cineca chromosome 1) [100%] 3 of 3 ✔ [7d/c7d7fe] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:MATCH_COMBINE (cineca) [100%] 1 of 1, failed: 1 ✘ [- ] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:PLINK2_SCORE - [- ] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:SCORE_AGGREGATE - [- ] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:SCORE_REPORT - [- ] process > PGSCATALOG_PGSCALC:PGSCALC:DUMPSOFTWAREVERSIONS - Execution cancelled -- Finishing pending tasks before exit Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:MATCH_COMBINE (cineca)'

Caused by: Process PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:MATCH_COMBINE (cineca) terminated with an error exit status (1)

Command executed:

export POLARS_MAX_THREADS=2

combine_matches --dataset cineca --scorefile scorefiles.txt.gz --matches *.ipc.zst -n 2 --min_overlap 0.75 --outdir $PWD --split -v

cat <<-END_VERSIONS > versions.yml MATCH_COMBINE: pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.version)')) END_VERSIONS

Command exit status: 1

Command output: (empty)

Command error: root: 2023-02-16 15:00:22 DEBUG Verbose logging enabled pgscatalog_utils.config: 2023-02-16 15:00:22 DEBUG Using 2 threads to read CSVs pgscatalog_utils.config: 2023-02-16 15:00:22 DEBUG polars threadpool size: 2 pgscatalog_utils.match.read: 2023-02-16 15:00:22 DEBUG Reading scorefile pgscatalog_utils.match.read: 2023-02-16 15:00:24 DEBUG --chrom parameter not set, using all variants in scoring file pgscatalog_utils.match.preprocess: 2023-02-16 15:00:24 DEBUG Complementing column effect_allele pgscatalog_utils.match.preprocess: 2023-02-16 15:00:24 DEBUG Complementing column other_allele pgscatalog_utils.match.combine_matches: 2023-02-16 15:00:24 DEBUG Reading matches pgscatalog_utils.match.combine_matches: 2023-02-16 15:00:24 DEBUG Labelling match candidates pgscatalog_utils.match.label: 2023-02-16 15:00:24 DEBUG Labelling best match type (refalt > altref > ...) pgscatalog_utils.match.label: 2023-02-16 15:00:24 DEBUG Labelling duplicated best match: keeping first instance as best_match = True pgscatalog_utils.match.label: 2023-02-16 15:00:24 DEBUG Labelling multiple scoring file lines (accession/row_nr) that best_match to the same variant pgscatalog_utils.match.label: 2023-02-16 15:00:24 DEBUG Labelling all duplicates with exclude flag pgscatalog_utils.match.label: 2023-02-16 15:00:24 DEBUG Labelling ambiguous variants pgscatalog_utils.match.preprocess: 2023-02-16 15:00:24 DEBUG Complementing column REF pgscatalog_utils.match.label: 2023-02-16 15:00:24 DEBUG Labelling ambiguous variants with exclude flag pgscatalog_utils.match.label: 2023-02-16 15:00:24 DEBUG Labelling multiallelic matches with exclude flag pgscatalog_utils.match.label: 2023-02-16 15:00:24 DEBUG Not excluding flipped matches pgscatalog_utils.match.filter: 2023-02-16 15:00:25 DEBUG Filtering to best_match variants (with exclude flag = False) pgscatalog_utils.match.filter: 2023-02-16 15:00:25 DEBUG Calculating overlap between target genome and scoring file pgscatalog_utils.match.filter: 2023-02-16 15:00:28 ERROR Score PGS000027_hmPOS_GRCh38 fails minimum matching threshold (10.24% variants match) pgscatalog_utils.match.match_variants: 2023-02-16 15:00:28 CRITICAL Error: no target variants match any variants in scoring files Traceback (most recent call last): File "/venv/bin/combine_matches", line 8, in sys.exit(combine_matches()) File "/venv/lib/python3.10/site-packages/pgscatalog_utils/match/combine_matches.py", line 36, in combine_matches log_and_write(matches=matches, scorefile=scorefile, dataset=dataset, args=args) File "/venv/lib/python3.10/site-packages/pgscatalog_utils/match/match_variants.py", line 90, in log_and_write raise Exception("No valid matches found") Exception: No valid matches found

Work dir: /Users/**/Documents/*****/work/7d/c7d7fe7745242f17be911429d993ab

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

ERROR: No scores calculated!

What Can I do?

Thanks in advance!

nebfield commented 1 year ago

Hello,

The PGS scoring file for PGS000027 contains about 2.1 million variants across the entire genome. To calculate polygenic scores accurately, as described by the polygenic score authors, it's important that we only calculate scores using a similar number of variants.

By default we prevent scores being calculated if at least 75% of variants in the scoring file aren't present in the input target genomes (this parameter can be adjusted with --min_overlap, but it's a bad idea to adjust normally).

There are a few technical reasons why a scoring file might match badly, like:

But a 10% match rate on 1 chromosome is quite good! I think if you try rerunning the workflow using all of your chromosomes the error should hopefully fix itself 😁 It's important to set up the split chromosomes in a single samplesheet (one row per chromosome).

Cheers, Ben

gmmhe commented 1 year ago

Thank you so much Ben, when I used all the chromosomes worked!