Open davmlaw opened 1 year ago
@TheMadBug - are you able to paste your ClinVar XML code to this issue? Thanks
see attached - this was originally developed to change Clinvar's giant XML into manageable CSV files (500 rows each) for importing as test data into Shariant Test.
So my current plan was to still use the VCF for overall details (number of stars, pathogenicity etc), then lazily load the actual records for expert panels.
Unless you need the data during analysis (which you might) I think this might be a suprior solution, as otherwise we need to load a giant amount of data into our DB as well as re-create the overall meta data.
That said, if its needed for analysis, then I guess we don't have much of a choice - seems like we'd insert a very stripped down copy of the clinvar classifications into their own table, and then run a post step which re-creates overall clinvar stats for each variant (e.g. max number of stars, overall value, etc).
Do you have time for a Zoom to discuss?
Before we meet - I think I need to actually look at the XML and see what's in there
1/2 way through something - how about zoom Monday?
So the XML is the same XML Shariant submits to ClinVar - so it does contain the lab, the reviewer status etc. For testing when I convert the XML to Classifications, I do ignore a lot of it, but you can see many many sample classifications in Shariant Test e.g. https://test.shariant.org.au/classification/classification/104670
To my knowledge it is record by record, with no summary data.
The XML contains submissions ie individual classifications - they are not grouped by Variation ID (which maps to our alleles - clinvar allele is basically a locus)
Counting Variaiton IDs in VCF vs XML
XML: 2268459 VCF: 2248786 Common: 2181208 XML only: 87251, vcf_only: 67578
Why some in VCF only? They were all new - XML was 2023-07-06 VCF - clinvar_20230730.vcf.gz Looking up the VCF only ones - they are all very new, eg July 8th 2023
So we are talking about 4% of ClinVar in XML but not VCF
But how many can we actually load? Not that much... see attachment, will do stats tomorrow
import pandas as pd
import gzip
from lxml import etree
from cyvcf2 import VCF
vcf_filename = "/data/annotation/variantgrid_setup_data/clinvar/clinvar_20230730.vcf.gz"
vcf_variation_id = set()
for record in VCF(vcf_filename):
vcf_variation_id.add(record.ID)
######
xml_variation_set = set()
num_records = 0
path_to_file = '/data/annotation/variantgrid_setup_data/clinvar/ClinVarFullRelease_00-latest.xml.gz'
new_data = []
with gzip.open(path_to_file, 'rb') as file:
context = etree.iterparse(file, events=('end',), tag='{*}ClinVarSet')
# Iterate through the records
for event, elem in context:
num_records += 1
if num_records % 100000 == 0:
print(f"Processed {num_records}... kept rows = {len(new_data)}")
# Extract some specific data
rcva = elem.find('ReferenceClinVarAssertion')
if rcva is not None:
clinical_significance = elem.find('.//{*}ClinicalSignificance/{*}Description')
measure_set = rcva.find('MeasureSet')
if measure_set is not None:
variation_id = measure_set.attrib["ID"]
if variation_id not in vcf_variation_id:
# novel in XML
row = {
"variation_id": variation_id,
"type": measure_set.attrib["Type"],
}
if clinical_significance is not None:
row["clinical_significance"] = clinical_significance.text
preferred_name = measure_set.find('.//ElementValue[@Type="Preferred"]')
if preferred_name is not None:
row["preferred_name"] = preferred_name.text
g_hgvs_37 = measure_set.find(f'.//Attribute[@Type="HGVS, genomic, top level, previous"][@integerValue="37"]')
if g_hgvs_37 is not None:
row[f"g_hgvs_37"] = g_hgvs_37.text
g_hgvs_38 = measure_set.find(f'.//Attribute[@Type="HGVS, genomic, top level"][@integerValue="38"]')
if g_hgvs_38 is not None:
row[f"g_hgvs_38"] = g_hgvs_38.text
new_data.append(row)
# Clear the element to free memory
elem.clear()
# Also clear any previous siblings and references to free more memory
while elem.getprevious() is not None:
del elem.getparent()[0]
###
df = pd.DataFrame.from_dict(new_data)
df.to_csv("not_in_vcf.csv", index=False)
import numpy as np
import pandas as pd
df = pd.read_csv("./not_in_vcf.csv")
# df.loc[df['preferred_name'].str.contains('\?', regex=True, na=False), 'name_type'] = 'uncertain'
has_cdot = df["preferred_name"].str.contains(":c.", regex=False)
missing_g_hgvs = pd.isna(df["g_hgvs_37"]) & pd.isna(df["g_hgvs_38"])
df.loc[missing_g_hgvs & ~has_cdot, "name_type"] = "no g.HGVS"
c_uncertain_mask = df["preferred_name"].str.contains("\?", regex=True, na=False)
c_uncertain_mask |= df["preferred_name"].str.contains("_\(", regex=True, na=False)
c_uncertain_mask |= df["preferred_name"].str.contains("\)_", regex=True, na=False)
df.loc[missing_g_hgvs & has_cdot & c_uncertain_mask, "name_type"] = "c.HGVS only - uncertain"
iscn_mask = df["preferred_name"].str.contains("\)x\d", na=False)
iscn_mask |= df["preferred_name"].str.contains("\)\(", na=False)
iscn_mask |= df["preferred_name"].str.contains(r"(?:\d{1,2}|X|Y)[pq]\d{1,2}(\.\d{1,2})?(?:[;-](?:\d{1,2}|X|Y)[pq]\d{1,2}(\.\d{1,2})?)*[a-z]?(\(\d{1,7}_\d{1,7}\))?[xX]?\d*", regex=True)
df.loc[missing_g_hgvs & iscn_mask, "name_type"] = "ISCN"
has_colon = df["g_hgvs_37"].str.contains(":", na=False) | df["g_hgvs_38"].str.contains(":", na=False)
no_colon = ~missing_g_hgvs & ~has_colon
df.loc[no_colon, "name_type"] = "g.HGVS missing colon"
uncertain_mask = df["g_hgvs_37"].str.contains("\?", regex=True, na=False) | df["g_hgvs_38"].str.contains("\?", regex=True, na=False)
uncertain_mask |= df["g_hgvs_37"].str.contains("_\(", regex=True, na=False) | df["g_hgvs_38"].str.contains("_\(", regex=True, na=False)
uncertain_mask |= df["g_hgvs_37"].str.contains("\)_", regex=True, na=False) | df["g_hgvs_38"].str.contains("\)_", regex=True, na=False)
df.loc[uncertain_mask, "name_type"] = "g.HGVS uncertain"
## Try loading + get length
from hgvs.parser import Parser
parser = Parser()
potentials_mask = pd.isna(df["name_type"])
potentials_df = df[potentials]
potentials_series = potentials_df["name_type"]
error_series = pd.Series(index=df.index, dtype=str)
length_series = pd.Series(index=df.index, dtype=int)
for i, row in potentials_df.iterrows():
g_hgvs = row["g_hgvs_37"] or row["g_hgvs_38"]
length = None
if not pd.isna(g_hgvs):
try:
var_g = parser.parse_hgvs_variant(g_hgvs)
potentials_series[i] = "HGVS ok (g.)"
length = var_g.posedit.pos.end - var_g.posedit.pos.start
except Exception as e:
error_series[i] = str(e)
potentials_series[i] = "HGVS fail (g.)"
else:
c_hgvs = row["preferred_name"]
try:
var_c = parser.parse_hgvs_variant(c_hgvs)
potentials_series[i] = "c.HGVS ok (c.)"
length = var_c.posedit.pos.end - var_c.posedit.pos.start
except Exception as e:
error_series[i] = str(e)
potentials_series[i] = "c.HGVS fail (c.)"
if length is not None:
length_series[i] = length
df.loc[potentials_mask, "name_type"] = potentials_series
df["error"] = error_series
df["length"] = length_series
df.value_counts("name_type")
name_type
g.HGVS uncertain 54900
ISCN 26763
HGVS ok (g.) 2792
no g.HGVS 1573
c.HGVS only - uncertain 819
c.HGVS fail (c.) 322
c.HGVS ok (c.) 183
HGVS fail (g.) 81
dtype: int64
So - assuming we care about length > 10k - that's 1765 CNVs in ClinVar XML we can use
See spreadsheet
TODO:
Minimal CNV support
Future work - put off for now
I don't think it's worth changing everything and switching to the XML file now - can do it after SA Path upgrade
ClinVar are changing their XML format in fall of 2023 to better share somatic info - see preview info
VariantGrid currently loads ClinVar against Variant - this was historical as we started off with 1 build, and used VCFs (which are per-build). If we go to XML we could potentially change to link against Allele as that would allow us to only have 1. We already handle some stuff like this in Analysis (eg classifications and variant tags - there is only a slight performance hit vs going via variant)
We could potentially store all of the submissions/classifications in related objects, which would be useful for showing people on the variant page, classifications and generating discordances with ClinVar
Best to create a whole new model, and use the new postgres partitions - can load in parallel with old ones then switch with config to allow testing before deleting old ones.
ClinVar contains both builds (and clingen ID) we could use it for liftover during import by writing both builds linking to allele (liftover type = ClinVar)
Biocommons HGVS doesn’t support uncertain offsets - could implement this
ISCN support? Have people entered this in the search box by chance? did I put this somewhere? maybe a google doc?
Few questions about XML:
Other ClinVar XML code
The MacArthur lab (gnomAD) have a clinvar xml to CSV tool - but it says DEPRECATED: this code is out of date and no longer being maintained.
We currently load ClinVar from separate GRCh37/38 VCFs
We should switch to the XML as it contains more variants and more info.
For analysis performance reasons we probably need it to continue to go into its own partition per build, and have a 1-to-1 to Variant model that summarises things
lab specific classifications
You can get stars but at the moment you can't see which labs submitted (some are more trustworthy than others) - we would like to bring in this info. Maybe it'll go into JSON
CNVs
To Find CNVs in ClinVar, easiest way is to search, then refine search by type or size
Here is a 1kb one: NC_000013.10:g.32889619_32890666dup that is in VCF 5kb - NC_000013.11:g.32343971_32349243dup - in VCF after nov 2022 (not in vg.com as that's a bit old) 16kb NC_000013.11:g.32319209_32335388dup not in VCF
Download latest ClinVar XML