Open ikeseler opened 2 years ago
Aim: In the GCI, add the ability to add evidence and score ClinVar entries that are not otherwise described in a publication.
Collecting SCV and then all scoring features associated with the GCI are permitted.
Scope: A curator can enter a ClinVar ID in place of the “regular” PMID, metadata will be retrieved from ClinVar, and evidence from that ClinVar entry can be entered and scored.
Value: Increase the usability of the GCI by allowing curators to add evidence from additional sources.
@ikeseler was this scoped?
I followed up with Erin, Courtney and Marina to assess whether the proposed plan to use SCV IDs is sufficient. Sent: In the GCI, Gene-Disease record variants currently link out to ClinVar via VariationID. It sounds like the current request is to also add a functionality whereby ClinVar submissions can also be entered as evidence in the scoring table. As multiple sources can submit to ClinVar, it is probably best to designate the SCV as the ClinVar ID for the purposes of curation, since it is versioned and submitter-specific. Would you agree?
I found this example of a ClinVar entry where one submission (SCV001197996.1) is likely useful for curation while the second (the OMIM submission SCV001478329.1 which just cites a published paper) would be less useful (since it would be preferable to curate from the published primary source than the second degree source). Is SCV001197996.1 the type of evidence you are hoping to be able to add? Please confirm if yes, and if not, please share any alternative examples or additional details you think would be instructive.
Will update if I hear back from them with additional examples.
Erin suggested looking through the Brain Gene Registry for examples of useful ClinVar submissions.
Here is one such example from this ClinVar submitter. I think a challenge will be that, even though this submitter uses a versioned SCV, the clinical data in question is entered as freetext in the comments field. Perhaps we just import the entire comment into the GCI? Example: SCV003931173.1 in https://www.ncbi.nlm.nih.gov/clinvar/variation/1679524/
Example from Jules of ClinVar entry with clinical data on multiple individuals: https://www.ncbi.nlm.nih.gov/clinvar/variation/9/I
Example of SCVs referenced in curation: https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_41052ba5-878d-4e64-8ef2-ed65887a1345-2023-02-23T070000.000Z?page=1&size=25&search=
I made a mock-up of how we might want to add these non-PMID sources to the GCI. I also included links to Ingrid's prior mock-ups so we don't lose those.
Presented initial mock-ups to Gene Curation small group, they approve, added comments to slides.
Hi, @gcheung-SF. The docs on ClinVar's API can be found here: https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/
Example VCV ID + version: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=vcv&id=VCV000014206.1
Monica put together a set of slides: https://docs.google.com/presentation/d/1yAZBlGuQjeTTdRa5qC-D3QHM_mwDR-8w/edit#slide=id.g23d7e619b28_1_0
From what I remember, we originally wanted to use just the SCV ID to look up ClinVar evidence. However, ClinVar doesn't allow you to query by just SCV. So we settled on querying by VCV.
Another complication is that ClinVar says they are changing their API this fall: https://github.com/ncbi/clinvar
My branch is kind of messy due to rebases. Also, I started work on the code health scripts in this PR, but then moved the work to a different branch. It would probably be easier to start a new branch. I've extracted the useful code below.
I wrote up a fetch
function for getting the XML from ClinVar:
def fetch(vcv_id: str, vcv_version: str = "") -> str:
"""Get the VCV XML from the ClinVar website.
Optionally specify a version of the VCV we want to fetch. The
ClinVar API lets you specify versions. For example:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=vcv&id=VCV000014206.1
(Notice the .1 at the end of the URL.)
Args:
vcv_id: Variation ClinVar record, e.g. VCV000014206.
vcv_version: Version of the VCV we're interested in, e.g. 1.
Returns:
XML for the VCV ID.
"""
id_and_version = vcv_id if vcv_version == "" else vcv_id + "." + vcv_version
# The requests library allows you to put your query parameters in a
# dictionary like this rather than having to write them in the URL.
payload = {"db": "clinvar", "rettype": "vcv", "id": id_and_version}
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
try:
# ClinVar's server sometimes takes a while.
res = requests.get(url, payload, timeout=20)
return res.text
except requests.exceptions.RequestException as err:
logger.error(f"Error trying to fetch VCV with ID: {vcv_id})")
logger.error(err)
I also wrote a module that has a function to parse the XML and get the info we need:
"""Parse the XML we get from ClinVar."""
import xml.etree.ElementTree as ET
import typing
from src.helpers.clinvar_evidence_helpers import get_id
def from_xml(xml: str, scv: str) -> dict:
"""Return info we need from the VCV XML.
The info associated with the SCV we need from the XML are:
- submitter name
- submission date
- SCV version
Args:
xml: VCV XML we get from ClinVar.
scv: Submitted record in ClinVar, e.g. SCV000035529.
Returns:
Info the client-side code for the VCV/SCV combination.
"""
scv_id = get_id(scv)
info = {"submitter_name": "", "date": "", "scv_version": ""}
root = ET.fromstring(xml)
clinical_assertion_list = root.find(
"./VariationArchive/InterpretedRecord/ClinicalAssertionList"
)
for clinical_assertion in clinical_assertion_list:
clinvar_accession_el = clinical_assertion.find("./ClinVarAccession")
clinvar_accession_scv = clinvar_accession_el.get("Accession")
if clinvar_accession_scv == scv_id:
info["submitter_name"] = clinvar_accession_el.get("SubmitterName")
info["date"] = clinvar_accession_el.get(
"DateUpdated"
) or clinvar_accession_el.get("DateCreated")
info["scv_version"] = clinvar_accession_el.get("Version")
return info
Helper functions:
"""Define helper functions for getting evidence info from ClinVar."""
def get_id(vcv_or_scv: str) -> str:
"""Returns the ID for the given VCV or SCV.
Args:
vcv_or_scv: VCV ID (or SCV ID) and version separated by a
period, e.g. VCV000014206.1 (this is a VCV) or possibly
just the ID.
"""
return _get_id_or_version(vcv_or_scv, wants="id")
def get_version(vcv_or_scv: str) -> str:
"""Return the ID for the given VCV or SCV.
Args:
vcv_or_scv: VCV ID (or SCV ID) and version separated by a
period, e.g. VCV000014206.1 (this is a VCV) or possibly
just the ID.
"""
return _get_id_or_version(vcv_or_scv, wants="version")
def _get_id_or_version(vcv_or_scv: str, wants: str) -> str:
"""Return the ID or version number for the given VCV or SCV.
Args:
vcv_or_scv: VCV ID (or SCV ID) and version separated by a
period, e.g. VCV000014206.1 (this is a VCV) or possibly
just the ID.
wants: What the caller wants, i.e. ID or version.
"""
id_and_version = vcv_or_scv.rsplit(".", 1)
if len(id_and_version) == 2:
vcv_or_scv_id = id_and_version[0]
vcv_or_scv_version = id_and_version[1]
else:
vcv_or_scv_id = vcv_or_scv
vcv_or_scv_version = ""
if wants == "id":
return vcv_or_scv_id
if wants == "version":
return vcv_or_scv_version
return ""
YAML for route:
- http:
path: /clinvar-evidence/{vcv}/{scv}
method: get
cors: true
authorizer: aws_iam
documentation:
summary: "Retrieve info the client-side code needs for scoring ClinVar SCV evidence."
description: >-
Curators might want to use ClinVar SCV as their evidence
instead of a Pubmed article or Pubmed book article. This
route provides info related to a ClinVar SCV. We need a
ClinVar VCV in order to get the SCV info, so there are two
required parameters: (1) VCV and (2) SCV.
methodResponses:
-
statusCode: "200"
responseBody:
description: "The SCV info for the given VCV/SCV combo."
-
statusCode: "404"
responseBody:
description: "Empty JSON object."
-
statusCode: "400"
responseBody:
description: "An error describing what went wrong during the request."
Updated mock-up following tools call feedback: https://docs.google.com/presentation/d/1JG1oPRt6IaEWeBDLdF3FoK1lvfl_bc5Q/edit#slide=id.g28c095220f5_0_3
Drafted tooltip text explaining each of the following identifiers, including URLs to primary source:
PMID: Pubmed ID (PMID) is an identifier for peer-reviewed publications catalogued in the NCBI Pubmed database . To add a PMID to the GCI, enter the numerical portion only. For example, PMID:33242396 can be added by entering just 33242396.
VCV: 'VCV' refers to the accession calculated by ClinVar to aggregate information from all submitted records for classifications of the same variant, e.g. VCV001679524 . The number after the period in a VCV is the version of that particular accession. For example, VCV001679524.3 refers to version 3 of VCV001679524. The VCV (including version number) can be found in the Variant Details section, subsection Identifiers, next to the ClinVar Variation ID.
If you submit a query to ClinVar based on a VCV accession, e.g. VCV001679524.3, you are directed to the page specific to that record. This page will also list all submissions related to that variant from different laboratories and research groups, each with their own identifier (SCV). To add a ClinVar variant entry to the GCI, first enter the full VCV including the version number (for example, VCV001679524.3, not just VCV001679524).
SCV: 'SCV' refers to the accession number assigned to a submitted record in ClinVar, e.g. SCV003931173.
If you submit a query to ClinVar based on that accession number, e.g. SCV003931173, you are directed automatically to the VCV page that includes that submitted record. The SCV accession number and version are displayed in one of the Submissions sections, either Submissions - Germline or Submissions - Somatic, as appropriate. To add a particular submitter's ClinVar variant record to the GCI, enter their full SCV, including the version number (for example, SCV003931173.1, not just SCV003931173).
A few things: Screenshot 1:
Screenshot 2
Screenshot 3
Tools call feedback:
@gcheung-SF just tested the text changes on the test site and they look great. Thanks for the quick turnaround! Let me know when you want me to test the Preview Evidence Scored page.
The GCWG would like the ability to add evidence and score on ClinVar entries for gene curations.
JIRA ticket: https://broadinstitute.atlassian.net/browse/CGSP-174