Capturing supporting pubs referenced as free text (vs a CURIE or URL)

mbrush commented 1 year ago

Please review the "Executive Summary" of the TRAPI Spec for representing supporting publications for context.

As part of this spec, we need to decide how to handle the rare cases where a source provides a free text reference to a pub (instead of a CURIE or URL).

Vlado provided some examples of free-text pub references they get from HMDB:


1. Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids

2. Tinto WF, Reynolds WF, Seaforth CE, Mohammed S, Maxwell A. New bitter saponins from the bark of Colubrina elliptica: 1H and 13C assignments by 2D NMR spectroscopy. Magnetic resonance in chemistry 1993;31(9):859-864. [Structure]

3. Toranosuke Saito, 'Nuclear substituted salicylic acids and their salts.' U.S. Patent US5049685, issued November, 1979.: http://www.google.ca/patents/US5049685

nlharris commented 1 year ago

(Separating out the options so you can 👍 the one you prefer)

Option 1:

Lump these strings as reported by the source inside the same Attribute object with CURIEs and URLs.


{
  "attribute_type_id": "biolink:publications",           
  "value": [
           "PMID:31737390", 
           "PMID:6815562", 
           "http://info.gov.hk/gia/general/201011/02/P201011020204.htm", 
           "Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids",
           "Toranosuke Saito, Takashi Ishibashi, Tomoharu Shiozaki, Tetsuo Shiraishi, 'Developer for pressure-sensitive recording sheets, aqueous dispersion of the developer and method for preparing the developer.' U.S. Patent US5118443, issued September, 1986.: http://www.google.ca/patents/US5118443" 
            ],                                  
  "value_type_id": "string",    
  "attribute_source": "infores:hmdb"
}

This however poses challenges for the UI to parse them apart so it can display URL/CURE refs differently from free text refs in the display.

nlharris commented 1 year ago

Option 2:

A simple solution was to have KPs separate out non ULR/CURIE pub references into a separate Attribute object (that is also keyed on the biolink:publications edge property):


{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",
           "PMID:6815562",     
           "http://info.gov.hk/gia/general/201011/02/P201011020204.htm"
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:hmdb"
}, 
{
  "attribute_type_id": "biolink:publications",           
  "value": [
            "Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids",  
            "Toranosuke Saito, Takashi Ishibashi, Tomoharu Shiozaki, Tetsuo Shiraishi, 'Developer for pressure-sensitive recording sheets, aqueous dispersion of the developer and method for preparing the developer.' U.S. Patent US5118443, issued September, 1986.: http://www.google.ca/patents/US5118443"
           ],                                  
  "value_type_id": "string",    
  "attribute_source": "infores:hmdb"
}

The challenge here is that it requires each KP to refactor its ingest code to identify and separately package these non-URL/CURIE references into a different Attribute object.

In the long term, we will rely on a solution where KPs or some upstream service separates them in advance - as we don’t want the UI to have to do this kind of parsing.

In the short term, this is not something KPs currently have an appetite to do, and something that may be hard to have in place in time for the September release (unless this is decreed as a priority).

edeutsch commented 1 year ago

In addition to the voting, I suggest that a value_type_id of "string" is not useful because it is obvious from the data that it is a string and no additional information is conveyed. If the elements of the value are not URIs or CURIEs, then a value_type_id of biolink:free_text_citation or EDAM-DATA:0970 seems more descriptive and useful. https://www.ebi.ac.uk/ols4/ontologies/edam/classes/http%253A%252F%252Fedamontology.org%252Fdata_0970?lang=en

mbrush commented 1 year ago

@edeutsch re:

a value_type_id of biolink:free_text_citation or EDAM-DATA:0970 seems more descriptive and useful.

Doesn't tagging the value as a 'Citation' here amount to specifying the semantic type of the value in the value_type_id field?
You have been the advocating use of this field only to indicate the more technical/formal data type (e.g. CURIE, int, boolean, etc.) Maybe I am having trouble where you are drawing the line between technical/formal data type (allowed) vs semantic data type (not allowed)?

If you agree that it would not be right to say 'Citation' here - what would you recommend instead.

edeutsch commented 1 year ago

I am not certain I understand the question, but I think I am advocating:

{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",
           "PMID:6815562",     
           "http://info.gov.hk/gia/general/201011/02/P201011020204.htm"
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:hmdb"
}, 
{
  "attribute_type_id": "biolink:publications",           
  "value": [
            "Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids",  
            "Toranosuke Saito, Takashi Ishibashi, Tomoharu Shiozaki, Tetsuo Shiraishi, 'Developer for pressure-sensitive recording sheets, aqueous dispersion of the developer and method for preparing the developer.' U.S. Patent US5118443, issued September, 1986.: http://www.google.ca/patents/US5118443"
           ],                                  
  "value_type_id": "biolink:free_text_citation",    
  "attribute_source": "infores:hmdb"
}

If we don't or can't have biolink:free_text_citation, then I was offering EDAM-DATA:0970 as an established concept. However I suppose we would use EDAM-DATA:0970 with the proviso that we understand that EDAM-DATA:0970 is a free text citation as opposed to a biolink:Uriorcurie. (yes, EDAM-DATA:0970 is just "Citation", so I view biolink:free_text_citation as better. but it doesn't exist yet).

So I am advocating that when attribute_type_id=biolink:publications. that we use value_type_id to disambiguate whether the value (which is a string or array of strings) is a URI or CURIE or free text citation (each of which are physically stored as strings). This is to help the reader know how it should interpret that string. The reader's behavior should be somewhat different if the string is a URI vs. a CURIE vs. a free text citation. The reader could write a function to guess and check. But why not tell the reader, because the writer ought to know?

Does that answer the question or did I miss it?

edeutsch commented 1 year ago

Addressed in #422. closing.

NCATSTranslator / ReasonerAPI

Capturing supporting pubs referenced as free text (vs a CURIE or URL) #429

Option 1:

Option 2: