EBIvariation / CMAT

ClinVar Mapping and Annotation Toolkit
Apache License 2.0
18 stars 10 forks source link

Investigate how PubMed references are being processed #166

Closed tskir closed 3 years ago

tskir commented 3 years ago

Reported by @AsierGonzalez via Slack


Hi Kirill, I’d like to ask you where the literature references you include in the evidence strings under evidence.variant2disease.provenance_type.literature and literature come from. I would assume that they are extracted from the ClinVar XML but we have found a case (RCV000018694) where the ClinVar website and dbSNP list one publication but there are four in the evidence string (see the OT website).

{
  "references": [
    {
      "lit_id": "http://europepmc.org/abstract/MED/17886299"
    },
    {
      "lit_id": "http://europepmc.org/abstract/MED/20301468"
    },
    {
      "lit_id": "http://europepmc.org/abstract/MED/20301676"
    },
    {
      "lit_id": "http://europepmc.org/abstract/MED/21078917"
    }
  ]
}

As a side note, in the future we should get rid of the .literature field, as it’s a duplication of evidence.variant2disease.provenance_type.literature that we don’t use and just adds to the file size

tskir commented 3 years ago

Hi @ireneisdoomed, @DSuveges, I've looked into this question which was originally asked by Asier. As I mentioned before, ClinVar stores three types of literature references: disease specific, variant specific, and evidence support (“observed in”). I have investigated each of those types separately.

Evidence support (“observed in”) references

In the example Asier provided, the one reference displayed on the ClinVar website is the evidence support publication: 17886299 “Molecular consequences of dominant Bethlem myopathy collagen VI mutations”. This paper is about an observation of this specific variant (among others) in a specific disease.

Disease specific references

The three other references in that record, which are not displayed on the website but are stored in the XML, are disease specific. They are either reviews which summarise the knowledge on the disease, or clinical practice guidelines. In this example, the three publications are:

PubMed ID Type Title
20301676 Review Collagen Type VI-Related Disorders
21078917 Practice guideline Consensus statement on standard of care for congenital muscular dystrophies
20301468 Review Congenital Muscular Dystrophy Overview

Variant specific references

The references of this type are not present in this record, but I collected several examples from other records. These appear to also be large scale reviews and recommendations, but focusing on genetics rather than disease classes:

PubMed ID Title
21042222 Recommendations from the EGAPP Working Group: genomic profiling to assess cardiovascular risk to improve cardiovascular health
27841880 The genomic landscape of balanced cytogenetic abnormalities associated with human congenital anomalies
29805044 Risks and Recommendations in Prenatally Detected De Novo Balanced Chromosomal Rearrangements from Assessment of Long-Term Outcomes

Operation of pipeline v2.0.0+

Our pipeline only includes evidence support (“observed in”) literature references into the evidence strings. If you would like to see this changed, please let me know.