PGScatalog / PGS_Catalog

An open database of polygenic scores and relevant metadata needed to apply and evaluate them correctly.
Apache License 2.0
9 stars 5 forks source link

INDEL harmonized coordinate inconsistency #388

Open alkaZeltser opened 2 months ago

alkaZeltser commented 2 months ago

Hi PGS Catalog team, we love this resource but we wanted to bring to your attention an issue with INDEL harmonization:

My team and I noticed that the ENSEMBL-harmonized coordinates provided for INDELs in the PGS Catalog are systematically shifted from the coordinate assigned to the same variant in our genetic data files. We have GRCh38-based variant calls from both sequencing experiments and microarray genotyping. VCFs from either format report INDELs that are almost always 1bp off from the harmonized PGS catalog coordinate. We suspect that this is a result of differing conventions for INDEL reporting between VCF format (reporting insertions relative to the base immediately prior to the inserted bases) and the ENSEMBL reference (which reports insertions relative to the start of the actual insertion). Alternatively, this may be due to differing conventions between the ENSEMBL curated reference (one-based coordinates) and the UCSC curated reference (zero-based coordinates): https://useast.ensembl.org/Help/Faq?id=286#:~:text=Ensembl%20uses%20a%20one%2Dbased,genome%20housed%20at%20the%20GRC.

For example:

PGS Catalog coordinate file entry

rs34295433 from PGS000662 (PGS000662_hmPOS_GRCh38.txt.gz)

rsID chr_name chr_position effect_allele other_allele hm_source hm_rsID hm_chr hm_pos
rs34295433 1 183032447 CTAAG C ENSEMBL rs34295433 1 183063313

ENSEMBL record (matching):

Chromosome 1:183063313-183063315 (forward strand)|VCF:1 183063313 rs34295433 T TAAAT,TAAGT

dbSNP record (not matching):

Screenshot 2024-09-23 at 8 07 43 PM

VCF record (aligned to the UCSC-based GRCh38 reference provided by the GATK toolkit ) (not matching):

CHROM POS ID REF ALT
1 183063312 rs56677963;rs34295433 C CTAAG

Since most PGS software (e.g. PLINK, pgsc_calc) matches genotype data to PGS coordinate files via CHROM/POS (not sure how else you could go about it other than the discouraged rsID matching), we noticed a systematic failure to match INDELs when using harmonized data. We wanted to point out that this discrepancy does not seem to come up as a warning in PGS Catalog documentation. There also doesn't seem to be any guidance in pgsc_calc documentation, which does not account for a potential mismatch in coordinate systems:

PGS000662_hmPOS_GRCh38,1,183063313,CTAAG,C,0.041323726,,,,,,,,,,,,,,unmatched,my_dataset

I dug through some of the matching source code and did not find any pre-processing that might be accounting for this.

Since pgsc_calc uses harmonized data automatically when original data from the correct reference genome is not available, the consequence of potentially dropping all INDEL variants would be good to advertise.

We're curious about the decision to use the ENSEMBL style for harmonization - it seems that a substantial number of PGS coordinate files are submitted by the original authors in non-ENSEMBL format. Particularly odd are the GRCh37 -> GRCh37 harmonized files which seem to primarily just shift INDEL coordinates and no others:

e.g. From PGS000662_hmPOS_GRCh37.txt.gz:

rsID chr_name chr_position effect_allele other_allele hm_source hm_rsID hm_chr hm_pos hm_match_chr hm_match_pos
rs34295433 1 183032447 CTAAG C ENSEMBL rs34295433 1 183032448 True False
rs7542260 1 5743196 T C ENSEMBL rs7542260 1 5743196 True True

We were wondering if the PGS Catalog team was aware of this issue, and if you have any advice on how best to approach correcting this. Presumably the options available to implement by the user would be:

  1. Adjust our VCFs (not ideal, given they are formatted by an established convention)
  2. Write code to screen PGS Catalog coordinate files for ENSEMBL-harmonized INDELs and shift coordinates prior to matching (unfortunately undermines the benefits of a standardized catalog, and complicates use of pgsc_calc).

Ideally we would wish for an additional harmonization field that uses our (commonly used) standard.

Thanks!

RoniHaas commented 2 months ago

Thank you for bringing this up @alkaZeltser !

Is there a way to know if the coordinates in non-harmonized files are according to ENSEMBL or UCSC @smlmbrt ?

smlmbrt commented 2 months ago

Hi @alkaZeltser, sorry for the delayed reply (annual leave). We are aware that the INDEL reporting is heterogeneous between scores and that the harmonised position may be different because of this. We aim to better handle INDELs in the Catalog and Calculator, so will use this as a motivating example. Just to say we've noted this and will discuss internally and probably present some possible solutions here.