Closed apriltuesday closed 3 months ago
Current idea is to use PGKB's variant table which provides locations (RefSeq chromosome + position) for all rsIDs in PGKB data. For example these are strings like NC_000001.11:46399999
or sometimes NC_000019.10:38499646_38499648
.
Then we should be able to get the reference and context bases from the FASTA file, and alternate allele(s) from the genotypes information in the clinical alleles table.
Quick check that parsing variant locations will be straightforward:
variants = read_tsv_to_df('variants.tsv')
# 6643 total
len(variants)
# 6645 non-empty
variants['Location'].count()
# only the 8 empty locations
variants[~variants['Location'].str.match('NC_[0-9.]+:[0-9]+(_[0-9]+)?', na=False)]
# 38 with a range
variants[variants['Location'].str.count('_') > 1].dropna()
BTW, of the 8 variants missing locations, several are deprecated or merged, but a couple are valid current RS that we could get locations for from Ensembl or NCBI.
rs1799735
rs2266637
rs29000568
rs35320474
rs36056065
rs3834935
rs4630
rs58425034
Should also confirm what we should do for variants like rs10420097
(dbSNP report, PGKB annotation). Should this one have two identifiers corresponding to the two alternate alleles that have annotations, i.e. 19_57633193_A_C
and 19_57633193_A_G
?
And then do we explode on the identifier? And what should we do about the genotypes?
Note on ranges in locations:
Logically I thought this should mean the indicated range is deleted from the reference, but I guess this was my mistake for trying to apply logic to variant descriptions... Some of them do line up in this way, for some the lengths match up but I don't see how they align to the actual reference, and for others I'm just clueless. Here are some examples:
Variant | Location | Genotype/Allele |
---|---|---|
rs36120609 | NC_000001.11:109737631_109737635 | TC/TC |
rs36120609 | NC_000001.11:109737631_109737635 | TC/TCCTC |
rs36120609 | NC_000001.11:109737631_109737635 | TCCTC/TCCTC |
rs201279313 | NC_000004.12:127735888_127735890 | TTA/TTA |
rs201279313 | NC_000004.12:127735888_127735890 | TTA/del |
rs201279313 | NC_000004.12:127735888_127735890 | del/del |
rs35068180 | NC_000011.10:102845217_102845220 | A/del |
rs35068180 | NC_000011.10:102845217_102845220 | AA |
rs35068180 | NC_000011.10:102845217_102845220 | del/del |
rs746071566 | NC_000013.11:48037783_48037801 | GGAGTC/GGAGTC |
rs746071566 | NC_000013.11:48037783_48037801 | GGAGTC/del |
rs746071566 | NC_000013.11:48037783_48037801 | del/del |
Thoughts on this @tcezard @M-casado? I've dumped all the variants with range locations in a spreadsheet here in case it's useful.
I've had a look at some of your example to try to see if there are enough of a pattern to find the algorithm. I'm posting the example for now: rs36120609
REF: 31 35
| |
TCCTC
G1:TC/TC
TC---
TC---
G2:TC/TCCTC
TC---
TCCTC
G3:TCCTC/TCCTC
TCCTC
TCCTC
rs201279313 is set the start to 127735888 in PharmGKB but dbsnp displays it two bases earlier
REF: 86 90
| |
TTATT
G1:TTA/TTA
TTATT
TTATT
G2:TTA/del
TTATT
---TT
G3:del/del
---TT
---TT
rs35068180 is set the end to 102845220 in PharmGKB but dbsnp displays it finishing on base after also in the second genotype the alleles are not separated by slash
REF: 17 21
| |
AAAAA
G1:A/del
AAAAA
AAAA-
G2:AA
AAAAA
AAAAA
G3:del/del
AAAA-
AAAA-
REF: 1415
||
TT
G1:G/TT
G-
TT
G2:GG
G-
G-
G3:TT/TT
TT
TT
Discussed a bit offline... I'll email PGKB to ask about how they represent the variant location. For now I can implement a (relatively) simple algorithm assuming the location specifies the reference allele, and skip & report if it's not found in the genotypes. This works for the first & last of Tim's examples, but not the other two.
Took me some time to get a bit familiar with these files and formats, but here's my take on the issue. As a summary, what you plan to do is the following, right?
graph TD
J[PharmGKB]
H[FASTA files]
E[Clinival alleles table]
A[Variant table]
D[Generate 'chr_pos_ref_alt' identifier]
S[NCBI Genome Assembly]
J --> A
J --> E
S --> H
A --> |locations 'Chr+Pos'| D
H --> |Reference + context| D
E --> |Alternate alleles| D
I parsed as well the variants.tsv
file, and I think the numbers increased a bit since your last parsing (or the version you got), I think in the count there were some mixed numbers (6645
non-empty and 6643
total) 😅:
import pandas as pd
df = pd.read_table('variants.tsv')
total_rows = len(df)
print("TOTAL:\t", total_rows)
non_empty_counts = df.count()
columns_with_missing_values = non_empty_counts[non_empty_counts < total_rows]
print(columns_with_missing_values)
TOTAL: 6797
Gene IDs 6151
Gene Symbols 6151
Location 6788
The only change is that there's one more row with missing location.
Correct me if I'm wrong, but given the way that we treat evidence strings normally (i.e. separating by variant), I think exploding by the chr_pos_ref_alt identifier would make sense. Especially if we have different clinical annotations for each, although I doubt it. For the example you gave (rs10420097 ), the clinical count seems to be 1 : |
Variant ID | Variant Name | Gene IDs | Gene Symbols | Location | Variant Annotation count | Clinical Annotation count | Level 1/2 Clinical Annotation count | Guideline Annotation count | Label Annotation count | Synonyms |
---|---|---|---|---|---|---|---|---|---|---|---|
PA166176938 | rs10420097 | PA37582 | ZNF211 | NC_000019.10:57633193 | 1 | 1 | 0 | 0 | 0 | NC_000019.10:g.57633193A>C, NC_000019.10:g.57633193A>C, ... |
With regards to the rest of the evidence strings, we don't explode by genotype, do we?
I checked a few examples like Tim and so far:
REF: 14 15
| |
T T
G1:G/TT
G-
TT
G2:GG
G-
G-
G3:TT/TT
TT
TT
chr13:48037783-48037801 (GRCh38.p14)
REF: 782 783 801
| | |
G GAGTCGGAGTCGGAGTCG
A1: delGAGTCG
G GAGTCGGAGTCG------
A2: dupGAGTCG
782 783 801 807
| | | |
G GGAGTCGGAGTCGGAGTCGGAGTCG
A3: dup(GAGTCG)â‚‚
782 783 801 807 813
| | | | |
G GGAGTCGGAGTCGGAGTCGGAGTCGGAGTCG
chr11:102845217-102845221
REF: ..217 ..221
| |
A AAA A
G1:delA
..217 ..221
| |
A AAA -
G2:dupA
..217 ..222
| |
A AAA A A
G3:dupAA
..217 ..223
| |
A AAA A AA
Regarding what @tcezard said:
rs201279313 is set the start to 127735888 in PharmGKB but dbsnp displays it two bases earlier
The reason in this case is that the sequence is palindromic, and both PGKB and dbSNP are correct (i.e. resulting sequence change is TTATT --> TT
):
chr4:127735886-127735890
REF: 86 90
| |
T TAT T
- dbSNP: delATT
86 90
| |
T T-- -
- PGKB: delTTA
86 90
| |
- --T T
It has to do with where they took these variants from. In the variant info for this rsID (PA166156703), you can see that the source is PGKB or dbSNP:
rs35068180 is set the end to 102845220 in PharmGKB but dbsnp displays it finishing on base after also in the second genotype the alleles are not separated by slash
I haven't checked it in detail, but seeing how it is all Adenines, I assume it's a similar case.
Great find @M-casado. That makes sense.
rs201279313 is defined as TTA > -
in GRCh37 by PharmGKB and this is the value that is being provided in clinical ann alleles.tsv
table
It is then combine with the position of the variants.tsv
table which provide the GRCh38 position.
That's why there is a disconnect between the position and the alleles found. I'm wondering if there is a way to detect if the allele has been define in a different assembly.
Thanks Marcos, taking your points in order:
chr_pos_ref_alt
identifiers.Glad to know it helped.
On exploding by genotype, here are some examples, there are also examples in the test output but I realise that's really hard to read in github...
You're absolutely correct that the uniqueness checks will need to be different from PharmGKB as opposed to ClinVar, and we should both document those and implement the duplication checks in this repo.
@M-casado @tcezard Did a complete run using the code in the PR, here are all the variants where the reference (determined by location) was not found among the alleles annotated by PGKB:
Variant | Location | Reference | Alleles |
---|---|---|---|
rs10170310 | NC_000002.12:138521352 | G | C,A |
rs1048943 | NC_000015.10:74720644 | T | A,G |
rs111618861 | NC_000008.11:56131824_56131828 | CAAAAA | C,CA |
rs11280056 | NC_000018.10:673444 | T | TTAAAGTTA,TTA |
rs144854329 | NC_000006.12:43768679_43768698 | TGGTCCCACTCTTCCCACAGG | T,TGGTCCCACTCTTCCCACA |
rs17885382 | NC_000006.12:32584318 | C | T,A |
rs201279313 | NC_000004.12:127735888_127735890 | TATT | T,TTTA |
rs2032582 | NC_000007.14:87531302 | A | T,C |
rs2228171 | NC_000002.12:169196995 | C | A,G |
rs2298383 | NC_000022.11:24429543 | C | A,G |
rs28364032 | NC_000017.11:45834976 | G | T,C |
rs35068180 | NC_000011.10:102845217_102845220 | GAAAA | GA,G |
rs57064725 | NC_000015.10:78540694 | C | A,G |
rs61767072 | NC_000004.12:3767577_3767588 | GGGCCGGGGGCGG | GGGGGCGGGGCCG,G |
rs61824877 | NC_000001.11:200273504 | G | T,A |
rs700518 | NC_000015.10:51236915 | T | C,A |
rs71486745 | NC_000010.11:94936018_94936021 | CTGTG | C,CGT |
rs72549309 | NC_000001.11:97740411_97740418 | GATGAATGA | G,GATGA |
rs746071566 | NC_000013.11:48037783_48037801 | AGGAGTCGGAGTCGGAGTCG | A,AGGAGTC |
I just checked the first one rs10170310 and that one's all correct, the annotation really does not contain the reference allele.
Also summary stats from the full run if you're curious:
Total clinical annotations: 5073
With RS: 4477 (88.25%)
1. Exploded by allele: 13497 (3.0x)
2. Exploded by drug: 18830 (1.4x)
3. Exploded by phenotype: 23086 (1.2x)
Total evidence strings: 25401
With CHEBI: 21154 (83.28%)
With EFO phenotype: 8381 (32.99%)
With functional consequence: 23271 (91.61%)
With VEP gene: 23271 (91.61%)
Gene comparisons per annotation
With PGKB genes: 4220 (83.19%)
With VEP genes: 4093 (80.68%)
PGKB genes != VEP genes: 774 (15.26%)
Great work, @apriltuesday. The explosion stats are very handy to know what increments the evidence strings.
Re. the variants, what was the problem of not having the reference annotated in PGKB? I can't remember right now if that was a bottleneck at some step. Wouldn't we simply not have an evidence string for the referenced, but be still able to annotate perfectly the alternative alleles?
It's because we can't tell if reference not being in PGKB means we're misinterpreting the coordinates / there's an error in the data (as in rs201279313), or the reference truly is not being annotated (e.g. rs10170310).
I think at this point we need to figure out which are definitely in the first category, so we can send PGKB an email with some examples and ask for clarification. I've added them in a tab to the spreadsheet so we can chip away at them together, there's not too many so it's hopefully not a huge task.
Whoops, probably shouldn't have closed this with the PR...
I assume that rs201279313 and similar cases won't have as much impact, given their number and the fact that the outcome of the variant seems to be the same. We do need to handle them carefully, like you said, to get their reference if we are trusting the coordinates.
I've made a first pass at this here and added my comments, if you want to take a look and add your thoughts (and check my work...)
Closing this, as after #37 the only variants without identifiers are repeats (#17) and mitochondrial (#38).
Necessary to get context bases for deletions, also removes need for API calls to Ensembl.