RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Should we ignore the nodes that are deprecated in kg2c build process? #1424

Open chunyuma opened 3 years ago

chunyuma commented 3 years ago

Based on the KG2.6.1c (http://kg2canonicalized.rtx.ai:7474/browser/) that @amykglen just built, I found that the nodes which are labelled as deprecated in KG2.6.1 are still used in KG2.6.1c. There are total 42,431 deprecated nodes in KG2.6.1. Should we ignore these nodes in KG2c build process?

match (n) where n.deprecated='True' return count(distinct n.id)
count(distinct n.id)
--
42431

Here is one example: In KG2.6.1, we have GO:0075020 which is labelled as deprecated

{
  "iri": "http://purl.obolibrary.org/obo/GO_0075020",
  "synonym": [
    "Ca++ or calmodulin-mediated activation of appressorium formation",
    "Ca2+ or calmodulin-mediated activation of appressorium formation"
  ],
  "category_label": "biological_process",
  "deprecated": "True",
  "name": "obsolete calcium or calmodulin-mediated activation of appressorium formation",
  "description": "Any process that modulates the frequency, rate or extent of symbiont calcium or calmodulin-mediated signal transduction during appressorium formation on or near its host organism. The host is defined as the larger of the organisms involved in a symbiotic interaction. [GOC:pamgo_curators]; OBSOLETE. Any process that modulates the frequency, rate or extent of symbiont calcium or calmodulin-mediated signal transduction during appressorium formation on or near its host organism. The host is defined as the larger of the organisms involved in a symbiotic interaction. // COMMENTS: This term was obsoleted because it represents a GO-CAM model.; UMLS Semantic Type: UMLS_STY:T038",
  "provided_by": "umls_source:GO",
  "id": "GO:0075020",
  "category": "biolink:BiologicalProcess",
  "update_date": "20210201"
}

But in KG2.6.1c, it still exists there.

{
  "iri": "http://purl.obolibrary.org/obo/GO_0075020",
  "expanded_categories": [
    "biolink:BiologicalEntity",
    "biolink:BiologicalProcess",
    "biolink:BiologicalProcessOrActivity",
    "biolink:NamedThing"
  ],
  "name": "obsolete calcium or calmodulin-mediated activation of appressorium formation",
  "description": "Any process that modulates the frequency, rate or extent of symbiont calcium or calmodulin-mediated signal transduction during appressorium formation on or near its host organism. The host is defined as the larger of the organisms involved in a symbiotic interaction. [GOC:pamgo_curators]; OBSOLETE. Any process that modulates the frequency, rate or extent of symbiont calcium or calmodulin-mediated signal transduction during appressorium formation on or near its host organism. The host is defined as the larger of the organisms involved in a symbiotic interaction. // COMMENTS: This term was obsoleted because it represents a GO-CAM model.; UMLS Semantic Type: UMLS_STY:T038",
  "equivalent_curies": [
    "GO:0075020"
  ],
  "id": "GO:0075020",
  "category": "biolink:BiologicalProcess",
  "all_names": [
    "obsolete calcium or calmodulin-mediated activation of appressorium formation"
  ],
  "all_categories": [
    "biolink:BiologicalProcess"
amykglen commented 3 years ago

hmm.. interesting idea. I think I'm in favor of keeping them in KG2c - even though they're deprecated, there are still edges that use them. so we would lose those edges if we didn't include them.

looks like there are about 130,000 edges in KG2.6.1 that use a deprecated node:

match (n)-[e]-() where n.deprecated='True' return count(distinct e)

returns 131,303

chunyuma commented 3 years ago

@amykglen, although there are some edges that are connected to them in KG2.6.1, I found that almost half of them have no name and no description so I think I will doubt the reliability of these edges.

match (n) where n.deprecated='True' and n.name is NULL and n.description is NULL return count(distinct n.id)

count(distinct n.id)
--
21671
n.id n.deprecated n.name n.description
"CHEBI:26169" "True" null null
"CHEBI:26165" "True" null null
"CHEBI:26166" "True" null null
"CHEBI:26168" "True" null null
"CHEBI:26161" "True" null null
"CHEBI:26162" "True" null null
"CHEBI:26163" "True" null null
"CHEBI:26164" "True" null null
finnagin commented 2 years ago

@amykglen is this still relevant?