ClinGen / gene-and-variant-curation-tools

ClinGen's gene and variant curation interfaces (GCI & VCI). Developed by Stanford ClinGen team.
https://curation.clinicalgenome.org/
MIT License
3 stars 1 forks source link

support registration and scoring of ClinVar ID entries to gene curation #254

Open ikeseler opened 2 years ago

ikeseler commented 2 years ago

The GCWG would like the ability to add evidence and score on ClinVar entries for gene curations.

JIRA ticket: https://broadinstitute.atlassian.net/browse/CGSP-174

wrightmw commented 1 year ago

Aim: In the GCI, add the ability to add evidence and score ClinVar entries that are not otherwise described in a publication.

Collecting SCV and then all scoring features associated with the GCI are permitted.

Scope: A curator can enter a ClinVar ID in place of the “regular” PMID, metadata will be retrieved from ClinVar, and evidence from that ClinVar entry can be entered and scored.

Value: Increase the usability of the GCI by allowing curators to add evidence from additional sources.

wrightmw commented 1 year ago

@ikeseler was this scoped?

bmpbowen commented 1 year ago

I followed up with Erin, Courtney and Marina to assess whether the proposed plan to use SCV IDs is sufficient. Sent: In the GCI, Gene-Disease record variants currently link out to ClinVar via VariationID. It sounds like the current request is to also add a functionality whereby ClinVar submissions can also be entered as evidence in the scoring table. As multiple sources can submit to ClinVar, it is probably best to designate the SCV as the ClinVar ID for the purposes of curation, since it is versioned and submitter-specific. Would you agree?

I found this example of a ClinVar entry where one submission (SCV001197996.1) is likely useful for curation while the second (the OMIM submission SCV001478329.1 which just cites a published paper) would be less useful (since it would be preferable to curate from the published primary source than the second degree source). Is SCV001197996.1 the type of evidence you are hoping to be able to add? Please confirm if yes, and if not, please share any alternative examples or additional details you think would be instructive.

Will update if I hear back from them with additional examples.

bmpbowen commented 1 year ago

Erin suggested looking through the Brain Gene Registry for examples of useful ClinVar submissions.

Here is one such example from this ClinVar submitter. I think a challenge will be that, even though this submitter uses a versioned SCV, the clinical data in question is entered as freetext in the comments field. Perhaps we just import the entire comment into the GCI? Example: SCV003931173.1 in https://www.ncbi.nlm.nih.gov/clinvar/variation/1679524/

bmpbowen commented 1 year ago

Example from Jules of ClinVar entry with clinical data on multiple individuals: https://www.ncbi.nlm.nih.gov/clinvar/variation/9/I

bmpbowen commented 1 year ago

Example of SCVs referenced in curation: https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_41052ba5-878d-4e64-8ef2-ed65887a1345-2023-02-23T070000.000Z?page=1&size=25&search=

bmpbowen commented 1 year ago

I made a mock-up of how we might want to add these non-PMID sources to the GCI. I also included links to Ingrid's prior mock-ups so we don't lose those.

bmpbowen commented 1 year ago

Presented initial mock-ups to Gene Curation small group, they approve, added comments to slides.

liammulh commented 1 year ago

Hi, @gcheung-SF. The docs on ClinVar's API can be found here: https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/

Example VCV ID + version: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=vcv&id=VCV000014206.1

Monica put together a set of slides: https://docs.google.com/presentation/d/1yAZBlGuQjeTTdRa5qC-D3QHM_mwDR-8w/edit#slide=id.g23d7e619b28_1_0

From what I remember, we originally wanted to use just the SCV ID to look up ClinVar evidence. However, ClinVar doesn't allow you to query by just SCV. So we settled on querying by VCV.

Another complication is that ClinVar says they are changing their API this fall: https://github.com/ncbi/clinvar

My branch is kind of messy due to rebases. Also, I started work on the code health scripts in this PR, but then moved the work to a different branch. It would probably be easier to start a new branch. I've extracted the useful code below.

I wrote up a fetch function for getting the XML from ClinVar:

def fetch(vcv_id: str, vcv_version: str = "") -> str:
    """Get the VCV XML from the ClinVar website.

    Optionally specify a version of the VCV we want to fetch. The
    ClinVar API lets you specify versions. For example:
    https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=vcv&id=VCV000014206.1
    (Notice the .1 at the end of the URL.)

    Args:
        vcv_id: Variation ClinVar record, e.g. VCV000014206.
        vcv_version: Version of the VCV we're interested in, e.g. 1.

    Returns:
        XML for the VCV ID.
    """

    id_and_version = vcv_id if vcv_version == "" else vcv_id + "." + vcv_version

    # The requests library allows you to put your query parameters in a
    # dictionary like this rather than having to write them in the URL.
    payload = {"db": "clinvar", "rettype": "vcv", "id": id_and_version}
    url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    try:
        # ClinVar's server sometimes takes a while.
        res = requests.get(url, payload, timeout=20)
        return res.text
    except requests.exceptions.RequestException as err:
        logger.error(f"Error trying to fetch VCV with ID: {vcv_id})")
        logger.error(err)

I also wrote a module that has a function to parse the XML and get the info we need:

"""Parse the XML we get from ClinVar."""

import xml.etree.ElementTree as ET
import typing
from src.helpers.clinvar_evidence_helpers import get_id

def from_xml(xml: str, scv: str) -> dict:
    """Return info we need from the VCV XML.

    The info associated with the SCV we need from the XML are:
        - submitter name
        - submission date
        - SCV version

    Args:
        xml: VCV XML we get from ClinVar.
        scv: Submitted record in ClinVar, e.g. SCV000035529.

    Returns:
        Info the client-side code for the VCV/SCV combination.
    """

    scv_id = get_id(scv)
    info = {"submitter_name": "", "date": "", "scv_version": ""}
    root = ET.fromstring(xml)
    clinical_assertion_list = root.find(
        "./VariationArchive/InterpretedRecord/ClinicalAssertionList"
    )
    for clinical_assertion in clinical_assertion_list:
        clinvar_accession_el = clinical_assertion.find("./ClinVarAccession")
        clinvar_accession_scv = clinvar_accession_el.get("Accession")
        if clinvar_accession_scv == scv_id:
            info["submitter_name"] = clinvar_accession_el.get("SubmitterName")
            info["date"] = clinvar_accession_el.get(
                "DateUpdated"
            ) or clinvar_accession_el.get("DateCreated")
            info["scv_version"] = clinvar_accession_el.get("Version")
    return info

Helper functions:

"""Define helper functions for getting evidence info from ClinVar."""

def get_id(vcv_or_scv: str) -> str:
    """Returns the ID for the given VCV or SCV.

    Args:
        vcv_or_scv: VCV ID (or SCV ID) and version separated by a
            period, e.g. VCV000014206.1 (this is a VCV) or possibly
            just the ID.
    """
    return _get_id_or_version(vcv_or_scv, wants="id")

def get_version(vcv_or_scv: str) -> str:
    """Return the ID for the given VCV or SCV.

    Args:
        vcv_or_scv: VCV ID (or SCV ID) and version separated by a
            period, e.g. VCV000014206.1 (this is a VCV) or possibly
            just the ID.
    """
    return _get_id_or_version(vcv_or_scv, wants="version")

def _get_id_or_version(vcv_or_scv: str, wants: str) -> str:
    """Return the ID or version number for the given VCV or SCV.

    Args:
        vcv_or_scv: VCV ID (or SCV ID) and version separated by a
            period, e.g. VCV000014206.1 (this is a VCV) or possibly
            just the ID.
        wants: What the caller wants, i.e. ID or version.
    """
    id_and_version = vcv_or_scv.rsplit(".", 1)
    if len(id_and_version) == 2:
        vcv_or_scv_id = id_and_version[0]
        vcv_or_scv_version = id_and_version[1]
    else:
        vcv_or_scv_id = vcv_or_scv
        vcv_or_scv_version = ""
    if wants == "id":
        return vcv_or_scv_id
    if wants == "version":
        return vcv_or_scv_version
    return ""

YAML for route:

    - http:
        path: /clinvar-evidence/{vcv}/{scv}
        method: get
        cors: true
        authorizer: aws_iam
        documentation:
          summary: "Retrieve info the client-side code needs for scoring ClinVar SCV evidence."
          description: >-
            Curators might want to use ClinVar SCV as their evidence
            instead of a Pubmed article or Pubmed book article. This
            route provides info related to a ClinVar SCV. We need a
            ClinVar VCV in order to get the SCV info, so there are two
            required parameters: (1) VCV and (2) SCV.
          methodResponses:
            -
              statusCode: "200"
              responseBody:
                description: "The SCV info for the given VCV/SCV combo."
            -
              statusCode: "404"
              responseBody:
                description: "Empty JSON object."
            -
              statusCode: "400"
              responseBody:
                description: "An error describing what went wrong during the request."
bmpbowen commented 12 months ago

Updated mock-up following tools call feedback: https://docs.google.com/presentation/d/1JG1oPRt6IaEWeBDLdF3FoK1lvfl_bc5Q/edit#slide=id.g28c095220f5_0_3

bmpbowen commented 8 months ago

Drafted tooltip text explaining each of the following identifiers, including URLs to primary source:

PMID: Pubmed ID (PMID) is an identifier for peer-reviewed publications catalogued in the NCBI Pubmed database . To add a PMID to the GCI, enter the numerical portion only. For example, PMID:33242396 can be added by entering just 33242396.

VCV: 'VCV' refers to the accession calculated by ClinVar to aggregate information from all submitted records for classifications of the same variant, e.g. VCV001679524 . The number after the period in a VCV is the version of that particular accession. For example, VCV001679524.3 refers to version 3 of VCV001679524. The VCV (including version number) can be found in the Variant Details section, subsection Identifiers, next to the ClinVar Variation ID.

If you submit a query to ClinVar based on a VCV accession, e.g. VCV001679524.3, you are directed to the page specific to that record. This page will also list all submissions related to that variant from different laboratories and research groups, each with their own identifier (SCV). To add a ClinVar variant entry to the GCI, first enter the full VCV including the version number (for example, VCV001679524.3, not just VCV001679524).

SCV: 'SCV' refers to the accession number assigned to a submitted record in ClinVar, e.g. SCV003931173.

If you submit a query to ClinVar based on that accession number, e.g. SCV003931173, you are directed automatically to the VCV page that includes that submitted record. The SCV accession number and version are displayed in one of the Submissions sections, either Submissions - Germline or Submissions - Somatic, as appropriate. To add a particular submitter's ClinVar variant record to the GCI, enter their full SCV, including the version number (for example, SCV003931173.1, not just SCV003931173).

bmpbowen commented 8 months ago

A few things: Screenshot 1:

image
  1. For the line "Enter a Reference: (PMID or VCV)" the colon should be after the parentheses, not before, as Enter a Reference (PMID or VCV ID):
  2. Can we update the error message to include guidance for non-PMID sources? For example, in this screenshot, an SCV is mistakenly entered first instead of a VCV but the error is about PMIDs. Can this instead say "PMIDs should only contain numbers. Versioned VCV IDs ( (VCV#####.#)) must be provided for ClinVar references"?

Screenshot 2

image
  1. Here, a VCV error appears when user tries to search for a VCV without a version number. Can we edit the error message to instead say "ClinVar reference should be a versioned VCV ID (VCV#####.#)"

Screenshot 3

image
  1. The grey text uses different spacing after the colons
  2. The grey text suggests that PMID: and ClinVar VCV ID: be included in the search term for the reference, but that would result in errors (see screenshot 4 below). Either the grey text should reflect the supported search format, or the search should be updated to allow for the inclusion of these names in the search box. Screenshot 4: image

Tools call feedback:

  1. Sharon - Can we automatically add variants from the ClinVar ID already? To avoid people accidentally entering the wrong one. Or, Erin's suggestion: if a variant from a ClinVar SCV is being scored, check that variant ID added under individual matches the variation name in the ClinVar accession
  2. Erin - On the Preview Evidence scored section, can we make it more obvious in preview what is a PMID and what is a ClinVar accession? Perhaps put PMID in PMID, and ClinVar in bold in front of SCV?
bmpbowen commented 8 months ago

@gcheung-SF just tested the text changes on the test site and they look great. Thanks for the quick turnaround! Let me know when you want me to test the Preview Evidence Scored page.