Generate chr_pos_ref_alt variant identifiers using reference assembly

apriltuesday commented 1 year ago

Necessary to get context bases for deletions, also removes need for API calls to Ensembl.

apriltuesday commented 11 months ago

Current idea is to use PGKB's variant table which provides locations (RefSeq chromosome + position) for all rsIDs in PGKB data. For example these are strings like NC_000001.11:46399999 or sometimes NC_000019.10:38499646_38499648.

Then we should be able to get the reference and context bases from the FASTA file, and alternate allele(s) from the genotypes information in the clinical alleles table.

Quick check that parsing variant locations will be straightforward:

variants = read_tsv_to_df('variants.tsv')

# 6643 total
len(variants)

# 6645 non-empty
variants['Location'].count()

# only the 8 empty locations
variants[~variants['Location'].str.match('NC_[0-9.]+:[0-9]+(_[0-9]+)?', na=False)]

# 38 with a range
variants[variants['Location'].str.count('_') > 1].dropna()

apriltuesday commented 11 months ago

BTW, of the 8 variants missing locations, several are deprecated or merged, but a couple are valid current RS that we could get locations for from Ensembl or NCBI.

apriltuesday commented 11 months ago

Should also confirm what we should do for variants like rs10420097 (dbSNP report, PGKB annotation). Should this one have two identifiers corresponding to the two alternate alleles that have annotations, i.e. 19_57633193_A_C and 19_57633193_A_G?

And then do we explode on the identifier? And what should we do about the genotypes?

apriltuesday commented 11 months ago

Note on ranges in locations:

Logically I thought this should mean the indicated range is deleted from the reference, but I guess this was my mistake for trying to apply logic to variant descriptions... Some of them do line up in this way, for some the lengths match up but I don't see how they align to the actual reference, and for others I'm just clueless. Here are some examples:

Variant	Location	Genotype/Allele
rs36120609	NC_000001.11:109737631_109737635	TC/TC
rs36120609	NC_000001.11:109737631_109737635	TC/TCCTC
rs36120609	NC_000001.11:109737631_109737635	TCCTC/TCCTC
rs201279313	NC_000004.12:127735888_127735890	TTA/TTA
rs201279313	NC_000004.12:127735888_127735890	TTA/del
rs201279313	NC_000004.12:127735888_127735890	del/del
rs35068180	NC_000011.10:102845217_102845220	A/del
rs35068180	NC_000011.10:102845217_102845220	AA
rs35068180	NC_000011.10:102845217_102845220	del/del
rs746071566	NC_000013.11:48037783_48037801	GGAGTC/GGAGTC
rs746071566	NC_000013.11:48037783_48037801	GGAGTC/del
rs746071566	NC_000013.11:48037783_48037801	del/del

Thoughts on this @tcezard @M-casado? I've dumped all the variants with range locations in a spreadsheet here in case it's useful.

tcezard commented 11 months ago

I've had a look at some of your example to try to see if there are enough of a pattern to find the algorithm. I'm posting the example for now: rs36120609

REF: 31  35 
      |   |
      TCCTC

G1:TC/TC
      TC---
      TC---
G2:TC/TCCTC
      TC---
      TCCTC
G3:TCCTC/TCCTC
      TCCTC
      TCCTC

rs201279313 is set the start to 127735888 in PharmGKB but dbsnp displays it two bases earlier

REF: 86  90
      |   |
      TTATT

G1:TTA/TTA
      TTATT
      TTATT
G2:TTA/del
      TTATT
      ---TT
G3:del/del
      ---TT
      ---TT

rs35068180 is set the end to 102845220 in PharmGKB but dbsnp displays it finishing on base after also in the second genotype the alleles are not separated by slash

REF: 17  21
      |   |
      AAAAA
G1:A/del
      AAAAA
      AAAA-
G2:AA
      AAAAA
      AAAAA
G3:del/del
      AAAA-
      AAAA-

rs11322783

REF: 1415
      || 
      TT
G1:G/TT
      G-
      TT
G2:GG
      G-
      G-
G3:TT/TT
      TT
      TT

apriltuesday commented 11 months ago

Discussed a bit offline... I'll email PGKB to ask about how they represent the variant location. For now I can implement a (relatively) simple algorithm assuming the location specifies the reference allele, and skip & report if it's not found in the genotypes. This works for the first & last of Tim's examples, but not the other two.

M-casado commented 11 months ago

Took me some time to get a bit familiar with these files and formats, but here's my take on the issue. As a summary, what you plan to do is the following, right?

graph TD
    J[PharmGKB]
    H[FASTA files]
    E[Clinival alleles table]
    A[Variant table]    
    D[Generate 'chr_pos_ref_alt' identifier]
    S[NCBI Genome Assembly]
    J --> A
    J --> E
    S --> H
    A --> |locations 'Chr+Pos'| D
    H --> |Reference + context| D
    E --> |Alternate alleles| D

Re. https://github.com/EBIvariation/opentargets-pharmgkb/issues/5#issuecomment-1670894547

I parsed as well the variants.tsv file, and I think the numbers increased a bit since your last parsing (or the version you got), I think in the count there were some mixed numbers (6645 non-empty and 6643 total) 😅:

import pandas as pd
df = pd.read_table('variants.tsv')
total_rows = len(df)
print("TOTAL:\t", total_rows)
non_empty_counts = df.count()
columns_with_missing_values = non_empty_counts[non_empty_counts < total_rows]
print(columns_with_missing_values)

TOTAL:   6797
Gene IDs        6151
Gene Symbols    6151
Location        6788

The only change is that there's one more row with missing location.

Re. https://github.com/EBIvariation/opentargets-pharmgkb/issues/5#issuecomment-1673397235

Correct me if I'm wrong, but given the way that we treat evidence strings normally (i.e. separating by variant), I think exploding by the `chr_pos_ref_alt` identifier would make sense. Especially if we have different clinical annotations for each, although I doubt it. For the example you gave (`rs10420097`), the clinical count seems to be `1`:	Variant ID	Variant Name	Gene IDs	Gene Symbols	Location	Variant Annotation count	Clinical Annotation count	Level 1/2 Clinical Annotation count	Guideline Annotation count	Label Annotation count	Synonyms
PA166176938	rs10420097	PA37582	ZNF211	NC_000019.10:57633193	1	1	0	0	0	NC_000019.10:g.57633193A>C, NC_000019.10:g.57633193A>C, ...

With regards to the rest of the evidence strings, we don't explode by genotype, do we?

Re. https://github.com/EBIvariation/opentargets-pharmgkb/issues/5#issuecomment-1680266891

I checked a few examples like Tim and so far:

rs11322783 - Interpretable range

REF: 14 15
  |  | 
  T  T
G1:G/TT
  G-
  TT
G2:GG
  G-
  G-
G3:TT/TT
  TT
  TT

rs746071566 - Interpretable range

chr13:48037783-48037801 (GRCh38.p14)
REF: 782 783                801
   |   |                |
   G   GAGTCGGAGTCGGAGTCG
A1: delGAGTCG
   G   GAGTCGGAGTCG------
A2: dupGAGTCG
 782 783                 801   807
   |   |                 |     |
   G   GGAGTCGGAGTCGGAGTCGGAGTCG
A3: dup(GAGTCG)₂
 782 783                 801   807   813
   |   |                 |     |     |
   G   GGAGTCGGAGTCGGAGTCGGAGTCGGAGTCG

rs35068180 - Interpretable range

chr11:102845217-102845221
REF: ..217 ..221
     |     |
     A AAA A
G1:delA
 ..217 ..221
     |     |
     A AAA -
G2:dupA
 ..217   ..222
     |       |
     A AAA A A
G3:dupAA
 ..217    ..223
     |        |
     A AAA A AA

Regarding what @tcezard said:

rs201279313 is set the start to 127735888 in PharmGKB but dbsnp displays it two bases earlier

The reason in this case is that the sequence is palindromic, and both PGKB and dbSNP are correct (i.e. resulting sequence change is TTATT --> TT):

chr4:127735886-127735890
REF: 86    90
      |     |
      T TAT T

- dbSNP: delATT
     86    90
      |     |
      T T-- -

- PGKB: delTTA
     86    90
      |     |
      - --T T

It has to do with where they took these variants from. In the variant info for this rsID (PA166156703), you can see that the source is PGKB or dbSNP:

rs35068180 is set the end to 102845220 in PharmGKB but dbsnp displays it finishing on base after also in the second genotype the alleles are not separated by slash

I haven't checked it in detail, but seeing how it is all Adenines, I assume it's a similar case.

tcezard commented 11 months ago

Great find @M-casado. That makes sense. rs201279313 is defined as TTA > - in GRCh37 by PharmGKB and this is the value that is being provided in clinical ann alleles.tsv table It is then combine with the position of the variants.tsv table which provide the GRCh38 position.

That's why there is a disconnect between the position and the alleles found. I'm wondering if there is a way to detect if the allele has been define in a different assembly.

apriltuesday commented 11 months ago

Thanks Marcos, taking your points in order:

The diagram is beautiful :art: and exactly the approach I was taking. I might steal it for the readme actually...
You're totally right that the data I'm using is out of date, it's actually from September 2022 (which gives you an idea of how much the data is changing in a year...)
We are currently exploding the evidence by genotype, there are not multiple clinical annotation IDs per genotype but there are different phenotype strings (i.e. these) that contain relevant information (which we're not parsing, but OT might). Tim has also pointed out that the variant identifier OT uses is allele-centric but the various annotations in PGKB are genotype-centric. I think I'll create a separate issue for this because it needs some discussion.
I agree that the ranges are generally interpretable (though I think in your examples you're checking the locations/alleles from dbSNP, which I feel are more consistent than in PharmGKB). The question is whether we can algorithmically translate them into consistent chr_pos_ref_alt identifiers.

M-casado commented 11 months ago

Glad to know it helped.

@apriltuesday, feel free to steal the diagram :)
If we are indeed exploding by genotype, I may need more details to understand how, because I always took the uniqueness tuple of an evidence string that I wrote here years ago, and I don't see how it contains any information about the pair of alleles, rather than a single allele.
True, I was using the rsIDs, but also I think I checked them from within the spreadsheet you provided. Perhaps I should double check. And I guess we cannot disregard PGMKB as the source because even if we took the coordinates and context from NCBI, the link between them and the annotation wouldn't be straightforward to make (?)

apriltuesday commented 11 months ago

On exploding by genotype, here are some examples, there are also examples in the test output but I realise that's really hard to read in github...

You're absolutely correct that the uniqueness checks will need to be different from PharmGKB as opposed to ClinVar, and we should both document those and implement the duplication checks in this repo.

apriltuesday commented 11 months ago

@M-casado @tcezard Did a complete run using the code in the PR, here are all the variants where the reference (determined by location) was not found among the alleles annotated by PGKB:

Variant	Location	Reference	Alleles
rs10170310	NC_000002.12:138521352	G	C,A
rs1048943	NC_000015.10:74720644	T	A,G
rs111618861	NC_000008.11:56131824_56131828	CAAAAA	C,CA
rs11280056	NC_000018.10:673444	T	TTAAAGTTA,TTA
rs144854329	NC_000006.12:43768679_43768698	TGGTCCCACTCTTCCCACAGG	T,TGGTCCCACTCTTCCCACA
rs17885382	NC_000006.12:32584318	C	T,A
rs201279313	NC_000004.12:127735888_127735890	TATT	T,TTTA
rs2032582	NC_000007.14:87531302	A	T,C
rs2228171	NC_000002.12:169196995	C	A,G
rs2298383	NC_000022.11:24429543	C	A,G
rs28364032	NC_000017.11:45834976	G	T,C
rs35068180	NC_000011.10:102845217_102845220	GAAAA	GA,G
rs57064725	NC_000015.10:78540694	C	A,G
rs61767072	NC_000004.12:3767577_3767588	GGGCCGGGGGCGG	GGGGGCGGGGCCG,G
rs61824877	NC_000001.11:200273504	G	T,A
rs700518	NC_000015.10:51236915	T	C,A
rs71486745	NC_000010.11:94936018_94936021	CTGTG	C,CGT
rs72549309	NC_000001.11:97740411_97740418	GATGAATGA	G,GATGA
rs746071566	NC_000013.11:48037783_48037801	AGGAGTCGGAGTCGGAGTCG	A,AGGAGTC

I just checked the first one rs10170310 and that one's all correct, the annotation really does not contain the reference allele.

Also summary stats from the full run if you're curious:

Total clinical annotations: 5073
        With RS: 4477 (88.25%)
                1. Exploded by allele: 13497 (3.0x)
                2. Exploded by drug: 18830 (1.4x)
                3. Exploded by phenotype: 23086 (1.2x)
Total evidence strings: 25401
        With CHEBI: 21154 (83.28%)
        With EFO phenotype: 8381 (32.99%)
        With functional consequence: 23271 (91.61%)
        With VEP gene: 23271 (91.61%)
Gene comparisons per annotation
        With PGKB genes: 4220 (83.19%)
        With VEP genes: 4093 (80.68%)
        PGKB genes != VEP genes: 774 (15.26%)

M-casado commented 11 months ago

Great work, @apriltuesday. The explosion stats are very handy to know what increments the evidence strings.

Re. the variants, what was the problem of not having the reference annotated in PGKB? I can't remember right now if that was a bottleneck at some step. Wouldn't we simply not have an evidence string for the referenced, but be still able to annotate perfectly the alternative alleles?

apriltuesday commented 11 months ago

It's because we can't tell if reference not being in PGKB means we're misinterpreting the coordinates / there's an error in the data (as in rs201279313), or the reference truly is not being annotated (e.g. rs10170310).

I think at this point we need to figure out which are definitely in the first category, so we can send PGKB an email with some examples and ask for clarification. I've added them in a tab to the spreadsheet so we can chip away at them together, there's not too many so it's hopefully not a huge task.

apriltuesday commented 11 months ago

Whoops, probably shouldn't have closed this with the PR...

M-casado commented 11 months ago

I assume that rs201279313 and similar cases won't have as much impact, given their number and the fact that the outcome of the variant seems to be the same. We do need to handle them carefully, like you said, to get their reference if we are trusting the coordinates.

apriltuesday commented 11 months ago

I've made a first pass at this here and added my comments, if you want to take a look and add your thoughts (and check my work...)

apriltuesday commented 3 months ago

Closing this, as after #37 the only variants without identifiers are repeats (#17) and mitochondrial (#38).

EBIvariation / opentargets-pharmgkb