bioinformaticaomicalabs commented 2 months ago

Description of the bug

Hello, first of all, congratulations on creating such a comprehensive tool like pgsc_calc. I have been trying to set up the analysis and encountered an issue that I don't understand how to solve or where it might come from. I am setting up the polygenic risk score analysis for individual samples with WGS data, and I started with the PGS000119 model. I have followed all the steps, including those reported here https://github.com/PGScatalog/pgsc_calc/discussions/123 to prepare the WGS data, but the process fails at the match_combine step. These are the steps I followed for preparing the data:

gVCF generation for PGS000119 specific positions (included in ejemplo_coordenadas.bed file)

/home/jc-server/BIOINFORMATICA/SOFTWARE/gatk/gatk --java-options "-Xmx4g" HaplotypeCaller -R /home/jc-server/BIOINFORMATICA/reference_genome/reference_benchmarking_genome_in_a_bottle/human_GRCh38_no_alt_analysis_set.fasta -L ejemplo_coordenadas.bed -I /home/jc-server/BIOINFORMATICA/AUTOMATIZACION/GENOMA/GERMINAL/OUTPUT/TEST/UDB/mapping/UDB_r_groups.bam -O UDBejemplo.gvcf -ERC BP_RESOLUTION --dbsnp ../00-All-chr.vcf.gz

VCF generation

/home/jc-server/BIOINFORMATICA/SOFTWARE/gatk/gatk --java-options "-Xmx4g" GenotypeGVCFs -R /home/jc-server/BIOINFORMATICA/reference_genome/reference_benchmarking_genome_in_a_bottle/human_GRCh38_no_alt_analysis_set.fasta -V UDBejemplo.gvcf --dbsnp ../00-All-chr.vcf.gz --include-non-variant-sites true -O UDBejemplo.vcf

chr removal from vcf

sed 's/chr\([0-9XYM]\)/\1/g' UDBejemplo.vcf > vcf/UDBejemplo-nochr.vcf

Command used and terminal output

nextflow run pgscatalog/pgscalc -profile docker --input samplesheet.csv --pgs_id PGS000119 --target_build GRCh38
nextflow run pgscatalog/pgscalc -profile docker --input samplesheet.csv --pgs_id PGS000119 --target_build GRCh38

 N E X T F L O W   ~  version 24.04.4

Launching `https://github.com/pgscatalog/pgscalc` [nauseous_brahmagupta] DSL2 - revision: 9bd9c431e7 [main]

------------------------------------------------------
  pgscatalog/pgsc_calc v2.0.0-beta.3-g9bd9c43
------------------------------------------------------
Core Nextflow options
  revision       : main
  runName        : nauseous_brahmagupta
  containerEngine: docker
  launchDir      : /home/jc-server/BIOINFORMATICA/Poligenic_Risk_Score/ejemplo_sencillo
  workDir        : /home/jc-server/BIOINFORMATICA/Poligenic_Risk_Score/ejemplo_sencillo/work
  projectDir     : /home/jc-server/.nextflow/assets/pgscatalog/pgscalc
  userName       : jc-server
  profile        : docker
  configFiles    : 

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use pgscatalog/pgsc_calc for your analysis please cite:

* The Polygenic Score Catalog
  https://doi.org/10.1101/2024.05.29.24307783
  https://doi.org/10.1038/s41588-021-00783-5

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/pgscatalog/pgsc_calc/blob/main/CITATIONS.md

executor >  local (4)
[75/78d8a4] process > PGSCATALOG_PGSCCALC:PGSCCALC:DOWNLOAD_SCOREFILES ([pgs_id:PGS000119, pgp_id:, trait_efo:]) [100%] 1 of 1 ✔
[73/65d587] process > PGSCATALOG_PGSCCALC:PGSCCALC:INPUT_CHECK:COMBINE_SCOREFILES (1)                            [100%] 1 of 1 ✔
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM                             -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR                            -
[skipped  ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_VCF (UDB chromosome ALL)               [100%] 1 of 1, stored: 1 ✔
[f9/c9761c] process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (UDB chromosome ALL)                     [100%] 1 of 1 ✔
[fa/63742b] process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (UDB)                                     [  0%] 0 of 1
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:PLINK2_SCORE                                      -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE                                   -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:REPORT:SCORE_REPORT                                           -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:DUMPSOFTWAREVERSIONS                                          -
ERROR ~ Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (UDB)'

Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (UDB)` terminated with an error exit status (15)

Command executed:

  export POLARS_MAX_THREADS=2

  pgscatalog-matchmerge                          --dataset UDB             --scorefile scorefiles.txt.gz             --matches *.ipc.zst             --min_overlap 0.75                                       --outdir $PWD                          --combined             -v

  cat <<-END_VERSIONS > versions.yml
  MATCH_COMBINE:
      pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
  END_VERSIONS

Command exit status:
  15

Command output:
  (empty)

Command error:
  pgscatalog.match.cli.merge_cli: 2024-10-03 12:51:43 DEBUG    Verbose logging enabled
  pgscatalog.match.cli.merge_cli: 2024-10-03 12:51:43 INFO     --cleanup set (default), temporary files will be deleted
  pgscatalog.match.lib.scoringfileframe: 2024-10-03 12:51:43 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
  pgscatalog.match.lib.scoringfileframe: 2024-10-03 12:51:43 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
  pgscatalog.match.lib._match.preprocess: 2024-10-03 12:51:43 DEBUG    Complementing column effect_allele
  pgscatalog.match.lib._match.preprocess: 2024-10-03 12:51:43 DEBUG    Complementing column other_allele
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling best match type (refalt > altref > ...)
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling duplicated best match: keeping first instance as best_match = True
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling multiple scoring file lines (accession/row_nr) that best_match to the same variant
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling all duplicates with exclude flag
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling ambiguous variants
  pgscatalog.match.lib._match.preprocess: 2024-10-03 12:51:43 DEBUG    Complementing column REF
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling ambiguous variants with exclude flag
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling multiallelic matches with exclude flag
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Not excluding flipped matches
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 INFO     --filter_IDs not set, skipping filtering
  pgscatalog.match.lib._match.filter: 2024-10-03 12:51:43 DEBUG    Filtering to best_match variants (with exclude flag = False)
  pgscatalog.match.lib._match.filter: 2024-10-03 12:51:43 DEBUG    Calculating overlap between target genome and scoring file
  pgscatalog.match.lib._match.filter: 2024-10-03 12:51:43 ERROR    Score PGS000119_hmPOS_GRCh38 fails minimum matching threshold (40.62% variants match)
  pgscatalog.match.lib._match.log: 2024-10-03 12:51:43 DEBUG    Aggregating best matches into a summary table
  pgscatalog.match.lib._match.plink: 2024-10-03 12:51:43 INFO     No variants with effect_type=EffectType.ADDITIVE, skipping deduplication
  pgscatalog.match.lib._match.plink: 2024-10-03 12:51:43 INFO     No variants with effect_type=EffectType.DOMINANT, skipping deduplication
  pgscatalog.match.lib._match.plink: 2024-10-03 12:51:43 INFO     No variants with effect_type=EffectType.RECESSIVE, skipping deduplication
  pgscatalog.match.lib.matchresult: 2024-10-03 12:51:43 WARNING  Score PGS000119_hmPOS_GRCh38 matching failed with match rate 0.40625
  Traceback (most recent call last):
executor >  local (4)
[75/78d8a4] process > PGSCATALOG_PGSCCALC:PGSCCALC:DOWNLOAD_SCOREFILES ([pgs_id:PGS000119, pgp_id:, trait_efo:]) [100%] 1 of 1 ✔
[73/65d587] process > PGSCATALOG_PGSCCALC:PGSCCALC:INPUT_CHECK:COMBINE_SCOREFILES (1)                            [100%] 1 of 1 ✔
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM                             -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR                            -
[skipped  ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_VCF (UDB chromosome ALL)               [100%] 1 of 1, stored: 1 ✔
[f9/c9761c] process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (UDB chromosome ALL)                     [100%] 1 of 1 ✔
[fa/63742b] process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (UDB)                                     [100%] 1 of 1, failed: 1 ✘
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:PLINK2_SCORE                                      -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE                                   -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:REPORT:SCORE_REPORT                                           -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:DUMPSOFTWAREVERSIONS                                          -
Execution cancelled -- Finishing pending tasks before exit
-[pgscatalog/pgsc_calc] Pipeline completed with errors-
ERROR ~ Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (UDB)'

Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (UDB)` terminated with an error exit status (15)

Command executed:

  export POLARS_MAX_THREADS=2

  pgscatalog-matchmerge                          --dataset UDB             --scorefile scorefiles.txt.gz             --matches *.ipc.zst             --min_overlap 0.75                                       --outdir $PWD                          --combined             -v

  cat <<-END_VERSIONS > versions.yml
  MATCH_COMBINE:
      pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
  END_VERSIONS

Command exit status:
  15

Command output:
  (empty)

Command error:
  pgscatalog.match.cli.merge_cli: 2024-10-03 12:51:43 DEBUG    Verbose logging enabled
  pgscatalog.match.cli.merge_cli: 2024-10-03 12:51:43 INFO     --cleanup set (default), temporary files will be deleted
  pgscatalog.match.lib.scoringfileframe: 2024-10-03 12:51:43 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
  pgscatalog.match.lib.scoringfileframe: 2024-10-03 12:51:43 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
  pgscatalog.match.lib._match.preprocess: 2024-10-03 12:51:43 DEBUG    Complementing column effect_allele
  pgscatalog.match.lib._match.preprocess: 2024-10-03 12:51:43 DEBUG    Complementing column other_allele
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling best match type (refalt > altref > ...)
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling duplicated best match: keeping first instance as best_match = True
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling multiple scoring file lines (accession/row_nr) that best_match to the same variant
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling all duplicates with exclude flag
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling ambiguous variants
  pgscatalog.match.lib._match.preprocess: 2024-10-03 12:51:43 DEBUG    Complementing column REF
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling ambiguous variants with exclude flag
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling multiallelic matches with exclude flag
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Not excluding flipped matches
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 INFO     --filter_IDs not set, skipping filtering
  pgscatalog.match.lib._match.filter: 2024-10-03 12:51:43 DEBUG    Filtering to best_match variants (with exclude flag = False)
  pgscatalog.match.lib._match.filter: 2024-10-03 12:51:43 DEBUG    Calculating overlap between target genome and scoring file
  pgscatalog.match.lib._match.filter: 2024-10-03 12:51:43 ERROR    Score PGS000119_hmPOS_GRCh38 fails minimum matching threshold (40.62% variants match)
  pgscatalog.match.lib._match.log: 2024-10-03 12:51:43 DEBUG    Aggregating best matches into a summary table
  pgscatalog.match.lib._match.plink: 2024-10-03 12:51:43 INFO     No variants with effect_type=EffectType.ADDITIVE, skipping deduplication
  pgscatalog.match.lib._match.plink: 2024-10-03 12:51:43 INFO     No variants with effect_type=EffectType.DOMINANT, skipping deduplication
  pgscatalog.match.lib._match.plink: 2024-10-03 12:51:43 INFO     No variants with effect_type=EffectType.RECESSIVE, skipping deduplication
  pgscatalog.match.lib.matchresult: 2024-10-03 12:51:43 WARNING  Score PGS000119_hmPOS_GRCh38 matching failed with match rate 0.40625
  Traceback (most recent call last):
executor >  local (4)
[75/78d8a4] process > PGSCATALOG_PGSCCALC:PGSCCALC:DOWNLOAD_SCOREFILES ([pgs_id:PGS000119, pgp_id:, trait_efo:]) [100%] 1 of 1 ✔
[73/65d587] process > PGSCATALOG_PGSCCALC:PGSCCALC:INPUT_CHECK:COMBINE_SCOREFILES (1)                            [100%] 1 of 1 ✔
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM                             -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR                            -
[skipped  ] process > PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_VCF (UDB chromosome ALL)               [100%] 1 of 1, stored: 1 ✔
[f9/c9761c] process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS (UDB chromosome ALL)                     [100%] 1 of 1 ✔
[fa/63742b] process > PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (UDB)                                     [100%] 1 of 1, failed: 1 ✘
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:PLINK2_SCORE                                      -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE                                   -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:REPORT:SCORE_REPORT                                           -
[-        ] process > PGSCATALOG_PGSCCALC:PGSCCALC:DUMPSOFTWAREVERSIONS                                          -
Execution cancelled -- Finishing pending tasks before exit
-[pgscatalog/pgsc_calc] Pipeline completed with errors-
ERROR ~ Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (UDB)'

Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_COMBINE (UDB)` terminated with an error exit status (15)

Command executed:

  export POLARS_MAX_THREADS=2

  pgscatalog-matchmerge                          --dataset UDB             --scorefile scorefiles.txt.gz             --matches *.ipc.zst             --min_overlap 0.75                                       --outdir $PWD                          --combined             -v

  cat <<-END_VERSIONS > versions.yml
  MATCH_COMBINE:
      pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
  END_VERSIONS

Command exit status:
  15

Command output:
  (empty)

Command error:
  pgscatalog.match.cli.merge_cli: 2024-10-03 12:51:43 DEBUG    Verbose logging enabled
  pgscatalog.match.cli.merge_cli: 2024-10-03 12:51:43 INFO     --cleanup set (default), temporary files will be deleted
  pgscatalog.match.lib.scoringfileframe: 2024-10-03 12:51:43 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
  pgscatalog.match.lib.scoringfileframe: 2024-10-03 12:51:43 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
  pgscatalog.match.lib._match.preprocess: 2024-10-03 12:51:43 DEBUG    Complementing column effect_allele
  pgscatalog.match.lib._match.preprocess: 2024-10-03 12:51:43 DEBUG    Complementing column other_allele
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling best match type (refalt > altref > ...)
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling duplicated best match: keeping first instance as best_match = True
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling multiple scoring file lines (accession/row_nr) that best_match to the same variant
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling all duplicates with exclude flag
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling ambiguous variants
  pgscatalog.match.lib._match.preprocess: 2024-10-03 12:51:43 DEBUG    Complementing column REF
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling ambiguous variants with exclude flag
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Labelling multiallelic matches with exclude flag
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 DEBUG    Not excluding flipped matches
  pgscatalog.match.lib._match.label: 2024-10-03 12:51:43 INFO     --filter_IDs not set, skipping filtering
  pgscatalog.match.lib._match.filter: 2024-10-03 12:51:43 DEBUG    Filtering to best_match variants (with exclude flag = False)
  pgscatalog.match.lib._match.filter: 2024-10-03 12:51:43 DEBUG    Calculating overlap between target genome and scoring file
  pgscatalog.match.lib._match.filter: 2024-10-03 12:51:43 ERROR    Score PGS000119_hmPOS_GRCh38 fails minimum matching threshold (40.62% variants match)
  pgscatalog.match.lib._match.log: 2024-10-03 12:51:43 DEBUG    Aggregating best matches into a summary table
  pgscatalog.match.lib._match.plink: 2024-10-03 12:51:43 INFO     No variants with effect_type=EffectType.ADDITIVE, skipping deduplication
  pgscatalog.match.lib._match.plink: 2024-10-03 12:51:43 INFO     No variants with effect_type=EffectType.DOMINANT, skipping deduplication
  pgscatalog.match.lib._match.plink: 2024-10-03 12:51:43 INFO     No variants with effect_type=EffectType.RECESSIVE, skipping deduplication
  pgscatalog.match.lib.matchresult: 2024-10-03 12:51:43 WARNING  Score PGS000119_hmPOS_GRCh38 matching failed with match rate 0.40625
  Traceback (most recent call last):
    File "/app/pgscatalog.utils/.venv/bin/pgscatalog-matchmerge", line 8, in <module>
      sys.exit(run_merge())
               ^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/merge_cli.py", line 70, in run_merge
      matchdf = write_matches(matchresults=matchresults, score_df=score_df)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/_write.py", line 33, in write_matches
      _ = matchresults.write_scorefiles(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/matchresult.py", line 329, in write_scorefiles
      raise ZeroMatchesError(
  pgscatalog.core.lib.pgsexceptions.ZeroMatchesError: All scores fail to meet match threshold 0.75

Work dir:
  /home/jc-server/BIOINFORMATICA/Poligenic_Risk_Score/ejemplo_sencillo/work/fa/63742bfc60f27f469e2a08f5abc444

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Relevant files

I attach the generated vcf, as I understand its format is correct for the analysis UDBejemplo-nochr.zip

System information

Nextflow version: 24.04.4.5917 Hardware: desktop Executor: local Container: docker OS: Linux pgsc_calc: 9bd9c431e7 (main)

smlmbrt commented 2 months ago

The score didn't meet the default variant overlap threshold:

  pgscatalog.match.lib.matchresult: 2024-10-03 12:51:43 WARNING  Score PGS000119_hmPOS_GRCh38 matching failed with match rate 0.40625
  pgscatalog.core.lib.pgsexceptions.ZeroMatchesError: All scores fail to meet match threshold 0.75

Since this is WGS data you either need to add homozygous REF sites from the scoring file back into the VCF, or using --min_overlap 0. Note those two sets of results will be different and each setting has implications.

bioinformaticaomicalabs commented 1 month ago

Hello, thanks for the help. I have reviewed my VCF, and it contains the homozygous positions for the reference. On the other hand, I have been checking all the processing outputs and noticed that in match/_log.csv.gz, the table is missing the IDs for the variants that do not match for my samples. Could it be possible that variants are not being parsed correctly? log.csv

nebfield commented 1 month ago

The variant IDs are missing from the log because no matching variant is found in your target genomes.

For example, the unmatched variant in PGS000119 at chromosome 16 / position 89919709 has effect allele T and other allele C.

If you want this variant to match your target genomes, then the corresponding variant in your VCF should have REF T & ALT C or REF C & ALT T. If your data aren't formatted this way then it's less likely that a matching variant will be found.

There are some other matching strategies that automatically happen which we describe in a supplement but the approach described above is the best and simplest method.

bioinformaticaomicalabs commented 1 month ago

Thank you for the information, with that I managed to write a R code for completing the information of the vcf based on the model. Thank you!

[!CAUTION] Note from PGS Catalog team: we have not tested this method.

I share the code in case it can help someone:

library(dplyr)
library(data.table)

# Load the VCF file and the model.txt file
vcf_file = "UDBsencillo.vcf"
model_file = "PGS000119_hmPOS_GRCh38.txt"

# Read the VCF file
vcf = fread(vcf_file, header = TRUE, skip = "#CHROM")

# Read the model.txt file and remove lines starting with #
model = fread(model_file, header = TRUE) %>%
  filter(substr(rsID, 1, 1) != "#")

# Function to find and replace information in the VCF based on model.txt
process_vcf_model = function(vcf, model) {
  for (i in 1:nrow(model)) {
    # Get information from model.txt
    hm_pos = model$hm_pos[i]
    effect_allele = model$effect_allele[i]
    hm_inferOtherAllele = strsplit(model$hm_inferOtherAllele[i], "/")[[1]]

    # Search for the position in the VCF
    vcf_row = which(vcf$POS == hm_pos)

    if (length(vcf_row) > 0) {
      # If ALT is ".", replace using REF and effect_allele from model.txt
      ref_allele = vcf$REF[vcf_row]
      alt_allele = vcf$ALT[vcf_row]

      if (alt_allele == ".") {
        # Get the first allele that is not the effect_allele
        other_allele = hm_inferOtherAllele[1]
        if (ref_allele == effect_allele) {
          vcf$ALT[vcf_row] = other_allele
        } else {
          vcf$ALT[vcf_row] = effect_allele
        }
      }
    }
  }
  return(vcf)
}

# Process the VCF file
vcf_modified = process_vcf_model(vcf, model)

# Save the modified VCF file
fwrite(vcf_modified, "complete.vcf", sep = "\t")

PGScatalog / pgsc_calc

Problems with MATCH_COMBINE step #380