ClinVar XML - Githubissues

davmlaw commented 1 year ago

We currently load ClinVar from separate GRCh37/38 VCFs

We should switch to the XML as it contains more variants and more info.

For analysis performance reasons we probably need it to continue to go into its own partition per build, and have a 1-to-1 to Variant model that summarises things

lab specific classifications

You can get stars but at the moment you can't see which labs submitted (some are more trustworthy than others) - we would like to bring in this info. Maybe it'll go into JSON

CNVs

To Find CNVs in ClinVar, easiest way is to search, then refine search by type or size

ClinVar’s VCF file includes variants that are simple alleles (not a haplotype or genotype) < 10 kb in length with precise endpoints mapped to the GRCh37 or GRCh38 human genome assemblies. Other variants are not in scope, including cytogenetic variants, copy number variants with inner and/or outer start and stop coordinates, and variants >10 kb.

Here is a 1kb one: NC_000013.10:g.32889619_32890666dup that is in VCF 5kb - NC_000013.11:g.32343971_32349243dup - in VCF after nov 2022 (not in vg.com as that's a bit old) 16kb NC_000013.11:g.32319209_32335388dup not in VCF

Download latest ClinVar XML

davmlaw commented 1 year ago

@TheMadBug - are you able to paste your ClinVar XML code to this issue? Thanks

TheMadBug commented 1 year ago

clinvar_parser.zip

see attached - this was originally developed to change Clinvar's giant XML into manageable CSV files (500 rows each) for importing as test data into Shariant Test.

TheMadBug commented 1 year ago

So my current plan was to still use the VCF for overall details (number of stars, pathogenicity etc), then lazily load the actual records for expert panels.

Unless you need the data during analysis (which you might) I think this might be a suprior solution, as otherwise we need to load a giant amount of data into our DB as well as re-create the overall meta data.

That said, if its needed for analysis, then I guess we don't have much of a choice - seems like we'd insert a very stripped down copy of the clinvar classifications into their own table, and then run a post step which re-creates overall clinvar stats for each variant (e.g. max number of stars, overall value, etc).

Do you have time for a Zoom to discuss?

davmlaw commented 1 year ago

Before we meet - I think I need to actually look at the XML and see what's in there

1/2 way through something - how about zoom Monday?

TheMadBug commented 1 year ago

So the XML is the same XML Shariant submits to ClinVar - so it does contain the lab, the reviewer status etc. For testing when I convert the XML to Classifications, I do ignore a lot of it, but you can see many many sample classifications in Shariant Test e.g. https://test.shariant.org.au/classification/classification/104670

To my knowledge it is record by record, with no summary data.

davmlaw commented 1 year ago

The XML contains submissions ie individual classifications - they are not grouped by Variation ID (which maps to our alleles - clinvar allele is basically a locus)

Counting Variaiton IDs in VCF vs XML

XML: 2268459 VCF: 2248786 Common: 2181208 XML only: 87251, vcf_only: 67578

Why some in VCF only? They were all new - XML was 2023-07-06 VCF - clinvar_20230730.vcf.gz Looking up the VCF only ones - they are all very new, eg July 8th 2023

So we are talking about 4% of ClinVar in XML but not VCF

But how many can we actually load? Not that much... see attachment, will do stats tomorrow

import pandas as pd
import gzip
from lxml import etree
from cyvcf2 import VCF

vcf_filename = "/data/annotation/variantgrid_setup_data/clinvar/clinvar_20230730.vcf.gz"
vcf_variation_id = set()
for record in VCF(vcf_filename):
    vcf_variation_id.add(record.ID)

######

xml_variation_set = set()
num_records = 0

path_to_file = '/data/annotation/variantgrid_setup_data/clinvar/ClinVarFullRelease_00-latest.xml.gz'
new_data = []

with gzip.open(path_to_file, 'rb') as file:
    context = etree.iterparse(file, events=('end',), tag='{*}ClinVarSet')

    # Iterate through the records
    for event, elem in context:
        num_records += 1
        if num_records % 100000 == 0:
            print(f"Processed {num_records}... kept rows = {len(new_data)}")

        # Extract some specific data
        rcva = elem.find('ReferenceClinVarAssertion')
        if rcva is not None:
            clinical_significance = elem.find('.//{*}ClinicalSignificance/{*}Description')
            measure_set = rcva.find('MeasureSet')
            if measure_set is not None:
                variation_id = measure_set.attrib["ID"]
                if variation_id not in vcf_variation_id:
                    # novel in XML

                    row = {
                        "variation_id": variation_id,
                        "type": measure_set.attrib["Type"],
                    }

                    if clinical_significance is not None:
                        row["clinical_significance"] = clinical_significance.text

                    preferred_name = measure_set.find('.//ElementValue[@Type="Preferred"]')
                    if preferred_name is not None:
                        row["preferred_name"] = preferred_name.text

                    g_hgvs_37 = measure_set.find(f'.//Attribute[@Type="HGVS, genomic, top level, previous"][@integerValue="37"]')
                    if g_hgvs_37 is not None:
                        row[f"g_hgvs_37"] = g_hgvs_37.text

                    g_hgvs_38 = measure_set.find(f'.//Attribute[@Type="HGVS, genomic, top level"][@integerValue="38"]')
                    if g_hgvs_38 is not None:
                        row[f"g_hgvs_38"] = g_hgvs_38.text

                    new_data.append(row)

        # Clear the element to free memory
        elem.clear()

        # Also clear any previous siblings and references to free more memory
        while elem.getprevious() is not None:
            del elem.getparent()[0]

###            

df = pd.DataFrame.from_dict(new_data)
df.to_csv("not_in_vcf.csv", index=False)

import numpy as np
import pandas as pd

df = pd.read_csv("./not_in_vcf.csv")

# df.loc[df['preferred_name'].str.contains('\?', regex=True, na=False), 'name_type'] = 'uncertain'

has_cdot = df["preferred_name"].str.contains(":c.", regex=False)
missing_g_hgvs = pd.isna(df["g_hgvs_37"]) & pd.isna(df["g_hgvs_38"])
df.loc[missing_g_hgvs & ~has_cdot, "name_type"] = "no g.HGVS"

c_uncertain_mask = df["preferred_name"].str.contains("\?", regex=True, na=False)
c_uncertain_mask |= df["preferred_name"].str.contains("_\(", regex=True, na=False)
c_uncertain_mask |= df["preferred_name"].str.contains("\)_", regex=True, na=False)

df.loc[missing_g_hgvs & has_cdot & c_uncertain_mask, "name_type"] = "c.HGVS only - uncertain"

iscn_mask = df["preferred_name"].str.contains("\)x\d", na=False)
iscn_mask |= df["preferred_name"].str.contains("\)\(", na=False)
iscn_mask |= df["preferred_name"].str.contains(r"(?:\d{1,2}|X|Y)[pq]\d{1,2}(\.\d{1,2})?(?:[;-](?:\d{1,2}|X|Y)[pq]\d{1,2}(\.\d{1,2})?)*[a-z]?(\(\d{1,7}_\d{1,7}\))?[xX]?\d*", regex=True)
df.loc[missing_g_hgvs & iscn_mask, "name_type"] = "ISCN"

has_colon = df["g_hgvs_37"].str.contains(":", na=False) | df["g_hgvs_38"].str.contains(":", na=False)
no_colon = ~missing_g_hgvs & ~has_colon
df.loc[no_colon, "name_type"] = "g.HGVS missing colon"

uncertain_mask = df["g_hgvs_37"].str.contains("\?", regex=True, na=False) | df["g_hgvs_38"].str.contains("\?", regex=True, na=False)
uncertain_mask |= df["g_hgvs_37"].str.contains("_\(", regex=True, na=False) | df["g_hgvs_38"].str.contains("_\(", regex=True, na=False)
uncertain_mask |= df["g_hgvs_37"].str.contains("\)_", regex=True, na=False) | df["g_hgvs_38"].str.contains("\)_", regex=True, na=False)

df.loc[uncertain_mask, "name_type"] = "g.HGVS uncertain"

## Try loading + get length

from hgvs.parser import Parser

parser = Parser()

potentials_mask = pd.isna(df["name_type"])
potentials_df = df[potentials]
potentials_series = potentials_df["name_type"]
error_series = pd.Series(index=df.index, dtype=str)
length_series = pd.Series(index=df.index, dtype=int)

for i, row in potentials_df.iterrows():
    g_hgvs = row["g_hgvs_37"] or row["g_hgvs_38"]
    length = None
    if not pd.isna(g_hgvs):
        try:
            var_g = parser.parse_hgvs_variant(g_hgvs)
            potentials_series[i] = "HGVS ok (g.)"
            length = var_g.posedit.pos.end - var_g.posedit.pos.start
        except Exception as e:
            error_series[i] = str(e)
            potentials_series[i] = "HGVS fail (g.)"

    else:
        c_hgvs = row["preferred_name"]
        try:
            var_c = parser.parse_hgvs_variant(c_hgvs)
            potentials_series[i] = "c.HGVS ok (c.)"
            length = var_c.posedit.pos.end - var_c.posedit.pos.start
        except Exception as e:
            error_series[i] = str(e)
            potentials_series[i] = "c.HGVS fail (c.)"

    if length is not None:
        length_series[i] = length

df.loc[potentials_mask, "name_type"] = potentials_series
df["error"] = error_series
df["length"] = length_series

df.value_counts("name_type")

name_type
g.HGVS uncertain           54900
ISCN                       26763
HGVS ok (g.)                2792
no g.HGVS                   1573
c.HGVS only - uncertain      819
c.HGVS fail (c.)             322
c.HGVS ok (c.)               183
HGVS fail (g.)                81
dtype: int64

So - assuming we care about length > 10k - that's 1765 CNVs in ClinVar XML we can use

See spreadsheet

vcf_only_sorted.csv

davmlaw commented 1 year ago

TODO:

Get a count of the types of novel variants in XML - big indels, uncertain ends, ISCN, etc

Minimal CNV support

Pull out big indels and convert to a VCF (ie with extra CNV INFO fields)
Write management command that "tacks on" this into ClinVarVersion

Future work - put off for now

I don't think it's worth changing everything and switching to the XML file now - can do it after SA Path upgrade

ClinVar are changing their XML format in fall of 2023 to better share somatic info - see preview info

VariantGrid currently loads ClinVar against Variant - this was historical as we started off with 1 build, and used VCFs (which are per-build). If we go to XML we could potentially change to link against Allele as that would allow us to only have 1. We already handle some stuff like this in Analysis (eg classifications and variant tags - there is only a slight performance hit vs going via variant)

We could potentially store all of the submissions/classifications in related objects, which would be useful for showing people on the variant page, classifications and generating discordances with ClinVar

Best to create a whole new model, and use the new postgres partitions - can load in parallel with old ones then switch with config to allow testing before deleting old ones.

ClinVar contains both builds (and clingen ID) we could use it for liftover during import by writing both builds linking to allele (liftover type = ClinVar)

Biocommons HGVS doesn’t support uncertain offsets - could implement this

ISCN support? Have people entered this in the search box by chance? did I put this somewhere? maybe a google doc?

Few questions about XML:

Can we get all the info from VCF (stars etc?)
What extra info can we get - lab submitter is a big one, also breaking up the individual classifications so you can see how they map to different diseases (currently clinvar mushes them all together into fields)

Other ClinVar XML code

The MacArthur lab (gnomAD) have a clinvar xml to CSV tool - but it says DEPRECATED: this code is out of date and no longer being maintained.

https://github.com/macarthur-lab/clinvar/

davmlaw commented 1 year ago

https://ncbiinsights.ncbi.nlm.nih.gov/2023/09/12/clinvar-somatic-variants-changes-xml/

SACGF / variantgrid

ClinVar XML #852