PGScatalog / pgscatalog_utils

(superseded by pygscatalog) Utilities for working with PGS Catalog API and scoring files
Apache License 2.0
4 stars 3 forks source link

MATCH_COMBINE assertion error when match dataframe is empty #60

Closed nebfield closed 8 months ago

nebfield commented 10 months ago
          I have the same error and I'm wondering if it has to do with multi-allelic variants? In my original bgen I have many multiallelic variants for example

alternate_ids rsid chromosome position number_of_alleles first_allele alternative_alleles . 21:10968913_G/A 21 10968913 2 A G . 21:10968913_G/C 21 10968913 2 C G

I'm just using one chunked pgen in my sameplesheet to test

sampleset,path_prefix,chrom,format
test,/home/bwolford/archive/pgen/h234_hrc_chr21_chunk1,21,pfile

I tried the --keep_multiallelic option but I get the same error.

nextflow run pgscatalog/pgscalc     -profile conda     --input samplesheet.csv --pgs_id PGS000752 --target_build GRCh38 --chrom 21 --keep_multiallelic
ERROR ~ Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (test)'

Caused by:
  Process `PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (test)` terminated with an error exit status (1)

Command executed:

  export POLARS_MAX_THREADS=2

  combine_matches                  --dataset test         --scorefile scorefiles.txt.gz         --matches *.ipc.zst         -n 2         --min_overlap 0.75                  --keep_multiallelic                  --outdir $PWD         --split                  -v

  cat <<-END_VERSIONS > versions.yml
  MATCH_COMBINE:
      pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  root: 2023-10-22 22:02:53 DEBUG    Verbose logging enabled
  pgscatalog_utils.config: 2023-10-22 22:02:53 DEBUG    Using 2 threads to read CSVs
  pgscatalog_utils.config: 2023-10-22 22:02:53 DEBUG    polars threadpool size: 2
  pgscatalog_utils.match.read: 2023-10-22 22:02:53 DEBUG    Reading scorefile
  pgscatalog_utils.match.read: 2023-10-22 22:02:53 DEBUG    --chrom parameter not set, using all variants in scoring file
  pgscatalog_utils.match.preprocess: 2023-10-22 22:02:53 DEBUG    Complementing column effect_allele
  pgscatalog_utils.match.preprocess: 2023-10-22 22:02:53 DEBUG    Complementing column other_allele
  pgscatalog_utils.match.combine_matches: 2023-10-22 22:02:53 DEBUG    Reading matches
  pgscatalog_utils.match.combine_matches: 2023-10-22 22:02:53 DEBUG    Labelling match candidates
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling best match type (refalt > altref > ...)
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling duplicated best match: keeping first instance as best_match = True
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling multiple scoring file lines (accession/row_nr) that best_match to the same variant
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling all duplicates with exclude flag
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling ambiguous variants
  pgscatalog_utils.match.preprocess: 2023-10-22 22:02:53 DEBUG    Complementing column REF
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling ambiguous variants with exclude flag
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Not excluding multiallelic variants
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Not excluding flipped matches
  Traceback (most recent call last):
    File "/home/bwolford/pgs_calc/work/conda/pgscatalog_utils-b4f3f611180e4ff75ddd463e7ba86339/bin/combine_matches", line 8, in <module>
      sys.exit(combine_matches())
    File "/home/bwolford/pgs_calc/work/conda/pgscatalog_utils-b4f3f611180e4ff75ddd463e7ba86339/lib/python3.10/site-packages/pgscatalog_utils/match/combine_matches.py", line 37, in combine_matches
      _check_duplicate_vars(matches)
    File "/home/bwolford/pgs_calc/work/conda/pgscatalog_utils-b4f3f611180e4ff75ddd463e7ba86339/lib/python3.10/site-packages/pgscatalog_utils/match/combine_matches.py", line 52, in _check_duplicate_vars
      assert max_occurrence == [1], "Duplicate IDs in final matches"
  AssertionError: Duplicate IDs in final matches

Work dir:
  /home/bwolford/pgs_calc/work/cd/da3fd357a9ab8d0b9d74c011c291ed

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details
ERROR ~ ERROR: Matching subworkflow failed

 -- Check '.nextflow.log' file for details
ERROR ~ ERROR: No results report written!

 -- Check '.nextflow.log' file for details
ERROR ~ ERROR: No scores calculated!

 -- Check '.nextflow.log' file for details`

_Originally posted by @bnwolford in https://github.com/PGScatalog/pgsc_calc/issues/72#issuecomment-1774187482_

nebfield commented 10 months ago

probably linked to #52

ElixBaSe commented 10 months ago

Hello, I'm trying to calculate a custom score, I also get a similar error, but the part where it says catches my attention:

pgscatalog_utils.match.read: 2023-10-23 15:26:08 DEBUG --chrom parameter not set, using all variants in scoring file

I'm not sure what it mean, and I can't find any information in the documentation. I tried specifying the --chrome parameter in my script, but now it says it was an unexpected parameter.

This is an example of my scorefile:

format_version=2.0

pgs_name=DIA_HIS_T2D

trait_reported=Type 2 diabetes

genome_build=GRCh38

chr_name chr_position effect_allele other_allele effect_weight 1 20729451 G C 0.018 1 39870793 T C 0.041 1 46358862 G A 0.008

This is an example of my samplesheet.csv file:

sampleset,path_prefix,chrom,format MCPS,data/genetics_regeneron/freeze_150k/data/imputation/oxford_qcd/per_chromosome/pgen_hds/mcps-freeze150k_qcd_chr1,1,pfile MCPS,/data/genetics_regeneron/freeze_150k/data/imputation/oxford_qcd/per_chromosome/pgen_hds/mcps-freeze150k_qcd_chr2,2,pfile MCPS,/data/genetics_regeneron/freeze_150k/data/imputation/oxford_qcd/per_chromosome/pgen_hds/mcps-freeze150k_qcd_chr3,3,pfile

Those are the options that I'm ussing in my script:

`echo start cd "$1" echo $PWD

source "$2" echo environment from "$2"

module load Anaconda3/2022.05 PLINK/2.00a2.3_x86_64 Python/3.10.4-GCCcore-11.3.0 Java/11.0.2 R/4.2.1-foss-2022a yaml-cpp/0.7.0-GCCcore-11.3.0 pip install pyyaml

for pgs_name in in ${@:3}; do echo $pgs_name start computation ./nextflow run pgscatalog/pgsc_calc -profile conda\ --input sample_sheet.csv\ --target_build GRCh38\ --parallel\ --outdir PRScalculated/MCPS$pgs_name\ --scorefile scorefile_338_DIA_HIS.txt;

echo $pgs_name finished done

echo all PRS in the list computed

bash 1.2.run_pgscatalog_custom_scorefile.sh working_directory anaconda_environment PGS_name1 PGS_name2`

This is the error:

executor > local (28) [8b/dd59d6] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:SAMPLESHEET_JSON (sample_sheet.csv) [100%] 1 of 1 ✔ [7d/dc9e54] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:COMBINE_SCOREFILES (1) [100%] 1 of 1 ✔ [- ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM - [skipped ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (MCPS chromosome 5) [100%] 23 of 23, stored: 23 ✔ [- ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF - [cc/ff8584] process > PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_VARIANTS (MCPS chromosome 1) [100%] 23 of 23 ✔ [54/c3811f] process > PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (MCPS) [100%] 3 of 3, failed: 3, retries: 2 ✘ [- ] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:PLINK2_SCORE - [- ] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:SCORE_AGGREGATE - [- ] process > PGSCATALOG_PGSCALC:PGSCALC:REPORT:SCORE_REPORT - [- ] process > PGSCATALOG_PGSCALC:PGSCALC:DUMPSOFTWAREVERSIONS - [67/413e7f] NOTE: ProcessPGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (MCPS)terminated with an error exit status (1) -- Execution is retried (1) [a5/c006f8] NOTE: ProcessPGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (MCPS)` terminated with an error exit status (1) -- Execution is retried (2) ERROR ~ Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (MCPS)'

Caused by: Process PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (MCPS) terminated with an error exit status (1)

Command executed:

export POLARS_MAX_THREADS=2

combine_matches --dataset MCPS --scorefile scorefiles.txt.gz --matches *.ipc.zst -n 2 --min_overlap 0.75 --outdir $PWD --split -v

cat <<-END_VERSIONS > versions.yml MATCH_COMBINE: pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.version)')) END_VERSIONS

Command exit status: 1

Command output: (empty)

Command error: root: 2023-10-23 15:01:45 DEBUG Verbose logging enabled pgscatalog_utils.config: 2023-10-23 15:01:45 DEBUG Using 2 threads to read CSVs pgscatalog_utils.config: 2023-10-23 15:01:45 DEBUG polars threadpool size: 2 pgscatalog_utils.match.read: 2023-10-23 15:01:45 DEBUG Reading scorefile pgscatalog_utils.match.read: 2023-10-23 15:01:45 DEBUG --chrom parameter not set, using all variants in scoring file pgscatalog_utils.match.preprocess: 2023-10-23 15:01:45 DEBUG Complementing column effect_allele pgscatalog_utils.match.preprocess: 2023-10-23 15:01:45 DEBUG Complementing column other_allele pgscatalog_utils.match.combine_matches: 2023-10-23 15:01:45 DEBUG Reading matches pgscatalog_utils.match.combine_matches: 2023-10-23 15:01:45 DEBUG Labelling match candidates pgscatalog_utils.match.label: 2023-10-23 15:01:45 DEBUG Labelling best match type (refalt > altref > ...) pgscatalog_utils.match.label: 2023-10-23 15:01:45 DEBUG Labelling duplicated best match: keeping first instance as best_match = True pgscatalog_utils.match.label: 2023-10-23 15:01:45 DEBUG Labelling multiple scoring file lines (accession/row_nr) that best_match to the same variant pgscatalog_utils.match.label: 2023-10-23 15:01:45 DEBUG Labelling all duplicates with exclude flag pgscatalog_utils.match.label: 2023-10-23 15:01:45 DEBUG Labelling ambiguous variants pgscatalog_utils.match.preprocess: 2023-10-23 15:01:45 DEBUG Complementing column REF pgscatalog_utils.match.label: 2023-10-23 15:01:45 DEBUG Labelling ambiguous variants with exclude flag pgscatalog_utils.match.label: 2023-10-23 15:01:45 DEBUG Labelling multiallelic matches with exclude flag pgscatalog_utils.match.label: 2023-10-23 15:01:45 DEBUG Not excluding flipped matches pgscatalog_utils.match.filter: 2023-10-23 15:01:45 DEBUG Filtering to best_match variants (with exclude flag = False) pgscatalog_utils.match.filter: 2023-10-23 15:01:45 DEBUG Calculating overlap between target genome and scoring file pgscatalog_utils.match.filter: 2023-10-23 15:01:46 ERROR Score scorefile_338_DIA_HIS fails minimum matching threshold (1.78% variants match) pgscatalog_utils.match.match_variants: 2023-10-23 15:01:46 CRITICAL Error: no target variants match any variants in scoring files Traceback (most recent call last): File "/gpfs3/well/emberson/users/rgu572/GWAS_Elix/GWAS_Regenie_NoBMI/PRS/env_prs/projectA-skylake/bin/combine_matches", line 8, in sys.exit(combine_matches()) File "/gpfs3/well/emberson/users/rgu572/GWAS_Elix/GWAS_Regenie_NoBMI/PRS/env_prs/projectA-skylake/lib/python3.10/site-packages/pgscatalog_utils/match/combine_matches.py", line 40, in combine_matches log_and_write(matches=matches, scorefile=scorefile, dataset=dataset, args=args) File "/gpfs3/well/emberson/users/rgu572/GWAS_Elix/GWAS_Regenie_NoBMI/PRS/env_prs/projectA-skylake/lib/python3.10/site-packages/pgscatalog_utils/match/match_variants.py", line 90, in log_and_write raise Exception("No valid matches found") Exception: No valid matches found

Work dir: /gpfs3/well/emberson/users/rgu572/GWAS_Elix/GWAS_Regenie_NoBMI/PRS/work/54/c3811f5c0147be7e6f41910bebde79

Tip: when you have fixed the problem you can continue the execution adding the option -resume to the run command line

-- Check '.nextflow.log' file for details ERROR ~ ERROR: Matching subworkflow failed

-- Check '.nextflow.log' file for details ERROR ~ ERROR: No results report written!

-- Check '.nextflow.log' file for details ERROR ~ ERROR: No scores calculated!

-- Check '.nextflow.log' file for details`

I'm hoping that you could provide some guidance or assistance in resolving it. Your help in this matter would be greatly appreciated.

nebfield commented 8 months ago

https://github.com/PGScatalog/pgscatalog_utils/releases/tag/v0.4.3