NCATSTranslator / reasoner-validator

Validation of Translator OpenAPI (TRAPI) messages both to TRAPI and Biolink Model standards. See https://ncatstranslator.github.io/reasoner-validator/
Other
2 stars 4 forks source link

Can abstract class category error be made into a warning instead? #88

Closed ecwood closed 1 year ago

ecwood commented 1 year ago

Expander Agent ran into a problem since some of the nodes in RTX-KG2 are categorized with an abstract class. (See https://github.com/RTXteam/RTX/issues/2046 and https://github.com/RTXteam/RTX-KG2/issues/286 for more information). While we would love to change these, in many cases, there is not a clear alternative. Unfortunately, this raises an error with the validator. It would be really helpful if, rather than raising an error when this occurs, a warning is raised instead.

Does this seem reasonable @sierra-moxon? I am also tagging @saramsey and @acevedol since they are also working on this issue for RTX-KG2.

sierra-moxon commented 1 year ago

Yes I agree; at least for September, these should be the "orange" level. What do you think @RichardBruskiewich?

edeutsch commented 1 year ago

Note that "orange" on the ARAX UI (if that's what you meant?) is still ERROR. The "red X" is the new critical error class, like JSON schema failures and other really bad things.

I think @ecwood proposes that it be a WARNING (which affords a green checkmark if there are no ERRORs)

I support this.

Also, do we want this feature for ALL abstract classes or just BiologicalEntity??

ecwood commented 1 year ago

Also, do we want this feature for ALL abstract classes or just BiologicalEntity??

As far as I can tell, per https://github.com/RTXteam/RTX-KG2/issues/286#issuecomment-1622135791, there are only three abstract classes (at least that RTX-KG2 uses as node categories): biolink:BiologicalEntity, biolink:InformationContentEntity, and biolink:OrganismalEntity.

I am not sure how relevant those two other classes are.

I think @ecwood proposes that it be a WARNING (which affords a green checkmark if there are no ERRORs)

Yes, this is what I mean.

sierra-moxon commented 1 year ago

My misunderstanding on color scheme (to me, green == go == pass), but I think its ok to leave this as a non-failure for September, because it does not lead to a triple that is blatantly false. Seeing biolink:BiologicalEntity might make someone pause because it is so broad (and I think on Friday we will discuss how to handle categories that are too broad in TAQA as led by Jason Flannick), but my intuition is that the user will be able to justify the broader category without losing faith in the results and I think the issues around SEMMEDDB are more pressing.

edeutsch commented 1 year ago

great, thanks! @RichardBruskiewich has agreed to work on it!

RichardBruskiewich commented 1 year ago

resolved by release 3.7.0

RichardBruskiewich commented 1 year ago

Also, do we want this feature for ALL abstract classes or just BiologicalEntity??

As far as I can tell, per RTXteam/RTX-KG2#286 (comment), there are only three abstract classes (at least that RTX-KG2 uses as node categories): biolink:BiologicalEntity, biolink:InformationContentEntity, and biolink:OrganismalEntity.

I am not sure how relevant those two other classes are.

I think @ecwood proposes that it be a WARNING (which affords a green checkmark if there are no ERRORs)

Yes, this is what I mean.

@edeutsch and @sierra-moxon, do I also add biolink:InformationContentEntity, and biolink:OrganismalEntity to the list of exceptions? I guess that can be a 3.7.1 patch release?

edeutsch commented 1 year ago

I don't think we got clarity on these other classes. I think we only have clarity for BiologicalEntity

sierra-moxon commented 1 year ago

Can you give me some examples of nodes tagged with biolink:OrganismalEntity? We have both biolink:IndividualOrganism and biolink:PopulationOfIndividualOrganisms in the model that are child classes of biolink:OrganismalEntity and are not abstract. Do either of those work?

I picked a few key children of biolink:InformationContentEntity that might work as replacements below. Do you have examples of the nodes and their sources where you're using biolink:InformationContentEntity that wouldn't fit in one of these? (note: I didn't pull all the descendants of biolink:InformationContentEntity, and the items below are snippets of their full class definitions -- e.g. without mappings) :

  study variable:
    is_a: information content entity
    description: a variable that is used as a measure in the investigation of a study

  dataset:
    description: >-
      an item that refers to a collection of data from a data source.
    is_a: information content entity

  evidence type:
    is_a: information content entity
    aliases: ['evidence code']
    description: >-
      Class of evidence that supports an association

  publication:
    is_a: information content entity
    description: >-
      Any ‘published’ piece of information. Publications are considered broadly 
      to include any document or document part made available in print or on the 
      web - which may include scientific journal issues, individual articles, and 
      books - as well as things like pre-prints, white papers, patents, drug 
      labels, web pages, protocol documents,  and even a part of a publication if 
      of significant knowledge scope (e.g. a figure, figure legend, or section 
      highlighted by NLP). 
sierra-moxon commented 1 year ago

In thinking about the decision more broadly to tag 'abstract' classes used as node categories as errors or warnings, I would vote on the side of "flagging" these as warnings and using those warnings to make tickets in Biolink for further development of the model. Helpful for those tickets are examples (or even suggestions and/or PRs) that I can review with you to figure out which more specific Biolink classes are appropriate and/or where adding classes to the model would help.

Neither of these additional abstract classes is "wrong" for a user to see (in my opinion).

edeutsch commented 1 year ago

Great, thanks, let's go with this then.

edeutsch commented 1 year ago

Regarding InformationContentEntity, here's what is in KG2:

n.id | n.name | n.description -- | -- | -- "FMA:85802" | "FMA attribute entity" | "UMLS Semantic Type: STY:T170" "HP:0032443" | "Past medical history" | "In a medical encounter, the physician generally will interview the patient about his or her current problem, and may perform additional testing. The past medical history (PMH) in contrast records information about the patient's medical, personal and family history that might be relevant to the presenting illness or to provide optimal clinical management. The PMH generally includes (if relevant) other major illnesses, hospitalizations, surgeries, injuries, allergies, gynecologic and obstetric history, family history, personal history including occupational history, alcohol and drug use, etc. []; In a medical encounter, the physician generally will interview the patient about his or her current problem, and may perform additional testing. The past medical history (PMH) in contrast records information about the patient's medical, personal and family history that might be relevant to the presenting illness or to provide optimal clinical management. The PMH generally includes (if relevant) other major illnesses, hospitalizations, surgeries, injuries, allergies, gynecologic and obstetric history, family history, personal history including occupational history, alcohol and drug use, etc.; UMLS Semantic Type: STY:T033" "NCIT:C20189" | "Property or Attribute" | "A distinguishing quality or prominent aspect of a person, object, action, process, or substance.; UMLS Semantic Type: STY:T077" "STY:T077" | "Conceptual Entity" | null "STY:T078" | "Idea or Concept" | null "STY:T079" | "Temporal Concept" | null "STY:T080" | "Qualitative Concept" | null "STY:T081" | "Quantitative Concept" | null "STY:T082" | "Spatial Concept" | null "STY:T089" | "Regulation or Law" | null "STY:T102" | "Group Attribute" | null "STY:T169" | "Functional Concept" | null "STY:T171" | "Language" | null "STY:T185" | "Classification" | null "owl:topObjectProperty" | null | null
ecwood commented 1 year ago

There's only one biolink:OrganismalEntity in RTX-KG2.8.3. I am honestly not sure how it even got that label:

{
  "iri": "http://id.nlm.nih.gov/mesh/D005007",
  "synonym": [
    "Primitive Societies",
    "Primitive Society",
    "Societies, Primitive",
    "Society, Primitive"
  ],
  "category_label": "organismal_entity",
  "deprecated": "True",
  "name": "Ethnology",
  "description": "The comparative and theoretical study of culture, often synonymous with cultural anthropology.; UMLS Semantic Type: STY:T098; UMLS Semantic Type: STY:T090",
  "provided_by": "['infores:mesh']",
  "id": "MESH:D005007",
  "category": "biolink:OrganismalEntity",
  "update_date": "2015"
}

I don't think it will fit into either of the categories you proposed, but I don't think it fits into biolink:OrganismalEntity either. We will have to track down why it is getting this label.

Also, if you noticed the deprecation, that is due to a documented bug (https://github.com/RTXteam/RTX-KG2/issues/315) in RTX-KG2.8.3 and isn't part of this issue.

For biolink:InformationContentEntity, there are 99325 total nodes with that category, so I took every 1000th one and put it in a table, for a more representative sample: Name Provided By
Embryonic day 53 ['infores:umls']
Miwokan ['infores:hl7-umls']
final fee ['infores:hl7-umls']
Cortisol.free/Cortisone.free ['infores:loinc-umls']
L-serine/Creatinine | Urine | Chemistry - non-challenge ['infores:umls']
Spermatozoa.pyriform head/100 spermatozoa ['infores:umls']
Glucose^post dose glucose ['infores:umls']
Effective regurgitant orifice area during diastole | Aortic valve | DICOM Simplified Adult Echo Report concepts ['infores:loinc-umls']
DateRange ['infores:loinc-umls']
Spermatozoa.tapering head/100 spermatozoa | Semen | Fertility testing ['infores:loinc-umls']
Insulin^13th specimen post XXX challenge ['infores:umls']
Views for thrombosis^W radionuclide IV ['infores:loinc-umls']
5 days post dose dexamethasone ['infores:loinc-umls']
Glucose^10M post 0.5 g/kg glucose IV ['infores:loinc-umls']
12 hours (qualifier value) ['infores:umls']
Pyruvate^post exercise ['infores:loinc-umls']
Aspergillus flavus Ab.IgE.RAST class ['infores:umls']
Dental treatment anatomical site access ['infores:loinc-umls']
R' wave duration.lead V4 ['infores:loinc-umls']
10 minutes post venistasis ['infores:umls']
Eosinophils/leukocytes | Pleural fluid | Hematology and Cell counts ['infores:umls']
Betula populifolia Ab.IgE/IgE.total | Serum | Allergy ['infores:umls']
Parakeet serum Ab.IgE.RAST class ['infores:loinc-umls']
Cortisol^15M post 1 ug/kg CRH IV ['infores:umls']
Gastrin^2.5H post 0.2 U/kg secretin ['infores:umls']
Mytilus edulis Ab.IgE/IgE.total | Serum | Allergy ['infores:loinc-umls']
Cells.G0+G1 phase/100 cells | XXX | Molecular pathology ['infores:umls']
Cortisol^pre or post XXX challenge ['infores:umls']
Linoleate/Creatinine | Urine | Chemistry - non-challenge ['infores:umls']
Somatotropin^23rd specimen post XXX challenge ['infores:umls']
Alpha-Phenyl-2-Piperidine acetate/Creatinine | Urine | Drug toxicology ['infores:umls']
Chrysanthemum cinerariifolium Ab.IgE.RAST class ['infores:loinc-umls']
Octopus vulgaris Ab.IgE.RAST class ['infores:loinc-umls']
C peptide^20M post XXX challenge ['infores:loinc-umls']
Pregabalin cutoff ['infores:umls']
Cells.CDA/100 cells ['infores:loinc-umls']
Amobarbital/Creatinine ['infores:loinc-umls']
Somatotropin^10H post XXX challenge ['infores:umls']
Triiodothyronine.free^7th specimen post XXX challenge ['infores:loinc-umls']
Corticotropin^1H post dose glucose|Pt|Plas ['infores:loinc-umls']
Blasts/100 leukocytes | Cerebral spinal fluid | Hematology and Cell counts ['infores:loinc-umls']
Gated cells.total ['infores:umls']
Insufficient (qualifier) ['infores:umls']
Lymphocytes.kappa/100 lymphocytes | Body fluid | Cell markers ['infores:umls']
Cells.CD3+CD16+/100 cells | Tissue and Smears | Cell markers ['infores:loinc-umls']
Morus alba Ab.IgE/IgE.total ['infores:loinc-umls']
1 hour post dose glucose ['infores:loinc-umls']
Dermatophagoides sp Ab.IgE.RAST class | Serum | Allergy ['infores:loinc-umls']
Views^W contrast retrograde via urethra ['infores:umls']
30 minutes post dose ornithine alpha-ketoglutarate ['infores:loinc-umls']
Views 2 and PA ['infores:loinc-umls']
Neural tube defect risk cutoff ['infores:loinc-umls']
7:00-7:30pm ['infores:loinc-umls']
Uranium.depleted/Creatinine | Urine | Drug toxicology ['infores:loinc-umls']
Surgical aspects ['infores:umls']
TLV-Biological Limit Value ['infores:umls']
6: 31658181-31656314 ['infores:umls']
D Antigen Unit per Milliliter ['infores:ncit']
Subject Received Steroids Within One Month Prior to Diagnosis of Disease Phase ['infores:ncit']
Friday ['infores:ncit']
1: 40036668-40034495 ['infores:umls']
Hour Times Milligram per Milliliter ['infores:umls']
Control of Drug Substance: Justification of Specification ['infores:ncit']
Experiment Start Date ['infores:umls']
Device Carcinogenic Testing Evaluation Method ['infores:umls']
Collected Time Duration ['infores:umls']
Hour Squared Times Picomole Per Liter ['infores:umls']
Milliliter per Kilogram ['infores:ncit']
Megarad ['infores:umls']
Defined Imaging Enhancement Rate Value ['infores:umls']
Walloon Language ['infores:ncit']
Apoptotic Index ['infores:ncit']
French Catheter Gauge ['infores:umls']
X: 122719583-122773357 ['infores:umls']
Subject Off Trial Following Assignment to Protocol Treatment Arm ['infores:ncit']
Isometric Muscle Strength, External Rotation ['infores:ncit']
Date of New Tumor Event ['infores:umls']
Day Times Millimole Per Liter Per Milligram Per Gram Per Day ['infores:umls']
Order ['infores:ncit']
Radiation Was Administered as an Additional Treatment for a New Tumor Event ['infores:ncit']
Exposure Dose ['infores:ncit']
A Medium Amount of Time ['infores:ncit']
Personal Values ['infores:psy-umls']
Self Psychology ['infores:psy-umls']
processed array data file ['infores:efo']
4-hydroxyphenylacetate measurement ['infores:efo']
X-12717 measurement ['infores:efo']
alpha-N-acetylgalactosaminide alpha-2,6-sialyltransferase 5 measurement ['infores:efo']
heparan sulfate glucosamine 3-O-sulfotransferase 4 measurement ['infores:efo']
secreted frizzled-related protein 1 measurement ['infores:efo']
allopregnanolone sulfate measurement ['infores:efo']
inositol measurement ['infores:efo']
blood molybdenum measurement ['infores:efo']
specifies value of ['infores:foodon']
methylation reaction ['infores:ino']
transmission electron microscopy ['infores:mi']
desumoylase assay ['infores:mi']
Kind of quantity - Equilibrium ['infores:umls-metathesaurus']
Wintun language ['infores:umls-metathesaurus']
RichardBruskiewich commented 1 year ago

In thinking about the decision more broadly to tag 'abstract' classes used as node categories as errors or warnings, I would vote on the side of "flagging" these as warnings and using those warnings to make tickets in Biolink for further development of the model. Helpful for those tickets are examples (or even suggestions and/or PRs) that I can review with you to figure out which more specific Biolink classes are appropriate and/or where adding classes to the model would help.

Neither of these additional abstract classes is "wrong" for a user to see (in my opinion).

Just to clarify @sierra-moxon and @edeutsch (cc: @ecwood) is the decision here just to add biolink:InformationContentEntity, and biolink:OrganismalEntity to the list of abstract classes allowed, or rather, to generally report any abstract class usage as a warning, not as an error?

RichardBruskiewich commented 1 year ago

...For biolink:InformationContentEntity, there are 99325 total nodes with that category, so I took every 1000th one and put it in a table, for a more representative sample:...

Concerning the list given above by @ecwood, I think that @sierra-moxon's general recommendation to tag as abstract usage as warnings then post suitable Biolink Model curation issues to resolve them, makes good sense.

For example, even the first entry in the table:

Name Provided By
Embryonic day 53 ['infores:umls']

already suggests to me the following more precise Biolink category:

  life stage:
    is_a: organismal entity
    description: >-
      A stage of development or growth of an organism,
      including post-natal adult stages

I'm sure if we go through the table, some of the table entries potentially have more precise category assignments than biolink:InformationContentEntity.

This will clearly be, though, an ambitious knowledge curation exercise!

RichardBruskiewich commented 1 year ago

To sum up, @ecwood, @sierra-moxon and @edeutsch, I do see a significant rationale to add biolink:InformationContentEntity to the list of abstract exceptions.

Perhaps biolink:OrganismalEntity does not need to be added, since @ecwood has already identified the one sole RTX entry to be fixed.

As for any other abstract categories not yet listed here, perhaps we don't add them unless they pop up as significant pervasive errors in some context?

edeutsch commented 1 year ago

agreed

RichardBruskiewich commented 1 year ago

Assumed resolved by reasoner-validator release 3.7.2

saramsey commented 1 year ago

Hi @ecwood I will respond about OrganismalEntity in an RTX-KG2 issue.

saramsey commented 1 year ago

Concerning the list given above by @ecwood, I think that @sierra-moxon's general recommendation to tag as abstract usage as warnings then post suitable Biolink Model curation issues to resolve them, makes good sense.

Great! Thank you Richard!

RichardBruskiewich commented 1 year ago

Hi @saramsey, basically, the code has been patched to let biolink:BiologicalEntity and biolink:InformationContentEntity with warnings, but not the biolink:OrganismalEntity since that seemed to just be a single fringe case? Eric D. should have already deployed the code to the ARAX UI. You can also run the latest reasoner-validator release directly against your KP locally.