NCATSTranslator / ReasonerAPI

NCATS Biomedical Translator Reasoners Standard API
35 stars 28 forks source link

Standard/conventions for representing supporting publications #398

Closed mbrush closed 1 year ago

mbrush commented 1 year ago

Testing efforts have identified variability w.r.t. how publications and other documents are represented as support for Edges and Study Result objects in TRAPI messages. The Biolink Model provides two edge properties ( supporting_document, and its child publications - which are the topic of the Biolink ticket here), as well as a Publication class.

Here. I would like to finalize and document conventions for how these elements are applied in Attributes in TRAPI messages to structure this type of information.

Questions/Issues:

  1. When to use these different properties? (ignore for now - as we are voting here on whether to have two properties at all))

  2. How to represent the value of these properties? a. Current convention is to use established document identifiers such as pmids, pmc ids, dois, etc. Should there set an order of preference here, or a requirement to use one over another (e.g. if a pub has a PMC id and PMID, always use the PMID)? Do we have a resource that maps between different types of identifiers for documents/publications?

  3. In cases where there are multiple documents/pubs supporting a given Statement or Study Result, should these be captured as a list/array of document identifiers in the Attribute.value field, or one at a time in separate Attribute objects?
    a. It was decided earlier that separate attributes were best - as this would allow for additional metadata about each publication to be captured in other Attribute fields (e.g. url, description/title, etc). But it wasn't clear if this was a requirement or recommendation? b. In practice, most KPs are providing lists of pmids in a single Attribute - esp in cases where there are very many supporting pubs (can be tens to hundreds in some cases). But the format/syntax used to represent such lists is varibale (e.g. a single string vs formal array of object references? if a string, use of comma vs pipe delimiters?)

  4. Where/how should we capture additional metadata about a publication (e.g. title, year, url, MeSH terms, journal, authors, etc)? a. Some of the use cases for applying such additional publication info are outlined in the ticket here - most relate to supporting O& O for ordering results, and which pubs to show first for a given result. b. Some of this info could be captured using Attribute fields such as value_url or description (see json example below) - but this doesn't work if we allow multiple pubs as a list in a single Attribute object.
    c. But TMKP already provides a publication metadata API that provides the types of info listed here - so representing an yof this in the publications Attribute seems duplicative. d. @bill-baumgartner I assume here that Publication/Documents are represented as instances of Biolink:Publication, which has a collection of node properties for describing things like title, authors, etc?

{
  "attribute_type_id": "biolink:publications",    
  "value":  "PMID:32961480",                                  
  "value_type_id": "biolink:Publication",    
  "value_url":  "https://pubmed.ncbi.nlm.nih.gov/32961480/",
  "description": Review article, "Discovery of novel liver X receptor inverse agonists as …",
  "attribute_source": "infores:molecular-data-provider"
}
mbrush commented 1 year ago

Proposal:

Given the state of affairs described above (i.e. the current preference of KPs to capture lists of pub ids in a single Attribute object, and the availability of TMKPs publication metadata endpoint to provision publication info on demand) the simplest path forward may be to:

{
  "attribute_type_id": "biolink:publications",    
  "value":   ["PMID:31737390", "PMC:6815562"] ,                                  
  "value_type_id": "biolink:Publication",    
  "attribute_source": "infores:text-mining-provider-targeted"
}
  "value":  ["PMID:31737390", "PMC:6815562"]   #  formal array of Publication object references
    vs.
  "value":  "PMID:31737390|PMC6815562"          #  single string, pipe delimiters
  "value":  "PMID:31737390, PMC:6815562"        #  single string, comma delimiters
mbrush commented 1 year ago

Related issue: how to represent Clinical Trials that support an Edge/Statement. At present, some KPs are using the publications slot to capture NCT clinical trial ids - but these ids represent trials, not publications:

image (note that this KP is actually using the biolink:Publication class as the attribute_type_id instead of the publications slot).

We need to decide if we want to allow this (which means considering something like NCT00494715 to be a document or publication describing a clinical trial, rather than the clinical trial itself), or define a different model/pattern to link an edge to a supporting clinical trial.

To me, these NCT records are descriptions of a clinical study/trial that was performed and the results that were generated - and the most accurate and useful representation would be to use a property like supporting_studies here, and treat the NCT record as an instance of a Study instead of a Publication. This would also allow us to directly capture information about the study like its dates, status, size, methods, design, site, etc. We would just need to think about how it would work to represent Studies as nodes in a KG, and flesh out our modeling of 'Study' in Biolink a bit. Here, we could work with the Multiomics Team - who is working on a Clinical Trials KP where I think this info would live. An Edge supported by the results of a clinical trial would references the identifier/curie for this study as a supporting_study, and more info about this study could be retrieved/looked up from this Clinical Trials KP. Just like the paradigm for Pubs with the Publication Metadata API.

I think it would be quite analogous to the publications modeling pattern we are putting in place, but with a supporting_studies property pointing at a Study instance, instead of a publications property pointing to a Publication instance. In both cases, we can represent the Pub or Study as a node in a graph in which we can capture its characteristics. The TMKP Team's Publication Metadata KP is working on this for publications, and I believe the Multiomics Team is working on a Clinical Trials KP that would do the same for trials . . . more info about this study could be retrieved/looked up from this Clinical Trials KP, just like the paradigm for Pubs with the Publication Metadata API.

That said, an argument could be made that NCT records are descriptions of clinical studies and their results, just like most journal articles are descriptions of studies that were preformed the the results they generated. But treating an NCT record like a publication makes it harder to capture the rich details of the trial itself - and if this is important, we should represent them as Studies, not Publications.

edeutsch commented 1 year ago

Is it the mere existence of the clinical trial that is providing the evidence or is it the published outcome/report of the clinical trial that is providing the evidence? I would imagine the latter? Perhaps it is the final report (I assume there always is one? I don't actually know) from the trial that should be cited as a publication? Perhaps these identifiers identify studies, yes, but most crucial identify the published outcome?

mbrush commented 1 year ago

There are also considerations about the form/syntax of the value for a CT.gov record. Currently referenced as a string representing the identifier, e.g. NCT00222573. Should we require a CURIE or URL form of this be captured? Bioregistry.io already defines one we could use, and an expansion. See https://bioregistry.io/registry/clinicaltrials.

Assuming this means we can require CURIE/URL-based representation of clinical trials, consider the concrete proposals below for how we reference supporting clinical trials. Examples below show supporting clinical trials alongside references to publications referenced as CURIEs/URLS vs free text, per the decision to split these into separate Attribute objects here.

Option 1: Consider clinical trial record ids to be Publications, and capture them using the publications edge property.

# Attribute containing pubs referenced by CURIE or URL (which here include NCT records)
{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",
           "PMID:6815562",     
           "http://info.gov.hk/gia/general/201011/02/P201011020204.htm",
           "clinicaltrials:NCT00222573",
           "clinicaltrials:NCT00503152"
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:hmdb"
}, 

# Attribute containing pubs referenced as free text
{
  "attribute_type_id": "biolink:publications",           
  "value": [
            "Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids",  
            "Toranosuke Saito, Takashi Ishibashi, Tomoharu Shiozaki, Tetsuo Shiraishi, 'Developer for pressure-sensitive recording sheets, aqueous dispersion of the developer and method for preparing the developer.' U.S. Patent US5118443, issued September, 1986.: http://www.google.ca/patents/US5118443"
           ],                                  
  "value_type_id": "string",    
  "attribute_source": "infores:hmdb"
}

Option 2: Consider clinical trial record ids to represent the study/trial itself, and capture using a supporting_studies edge property .

# Attribute containing pubs referenced by CURIE or URL (which here do not include clinical trial records)
{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",
           "PMID:6815562",     
           "http://info.gov.hk/gia/general/201011/02/P201011020204.htm"
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:hmdb"
}, 

# Attribute containing pubs referenced as free text
{
  "attribute_type_id": "biolink:publications",           
  "value": [
            "Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids",  
            "Toranosuke Saito, Takashi Ishibashi, Tomoharu Shiozaki, Tetsuo Shiraishi, 'Developer for pressure-sensitive recording sheets, aqueous dispersion of the developer and method for preparing the developer.' U.S. Patent US5118443, issued September, 1986.: http://www.google.ca/patents/US5118443"
           ],                                  
  "value_type_id": "string",    
  "attribute_source": "infores:hmdb"
},

# Attribute containing supporting clinical trials
{
  "attribute_type_id": "biolink:supporting_studies",           
  "value": [
           "clinicaltrials:NCT00222573",
           "clinicaltrials:NCT00503152"
           ],                                  
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:hmdb"
}
edeutsch commented 1 year ago

I don't have a strong opinion, but I think I'm liking option 1. It depends a little on my abstract question above to which there was no answer provided:

Is it the mere existence of the clinical trial that is providing the evidence or is it the published outcome/report of the clinical trial that is providing the evidence? I would imagine the latter? Perhaps it is the final report (I assume there always is one? I don't actually know) from the trial that should be cited as a publication? Perhaps these identifiers identify studies, yes, but most crucially identify the published outcome? Again, I don't know myself. Can someone shed light on this?

mbrush commented 1 year ago

May 22 Update:

Clinical trial records in clinical-trials.gov are often referenced as support for edges reporting a drug to treat a condition. e..g "NCT00222573". In recent discussions we settled on a specific CURIE syntax with which to reference clinical trials as reported in registries like clinical-trials.gov, in the Attribute.value field:

"value": "clinicaltrials:NCT00222573"

We also decided that any URLs that allow users to link directly to the ct.gov site to explore these clinical trial records should be captured in the Attribute.value_url field:

"value": "clinicaltrials:NCT00222573"
"value_url": "https://clinicaltrials.gov/search?id=%22NCT03074773%22"

We have yet to settle on what edge property will be the attribute_type_id for the Attribute object. We could choose to treat these NCT ids as publications and use the existing biolink:publications edge property, or we could use a more specific edge property like biolink:supporting_studies that distinguishes trials from pubs, (and lets us point to an instance of a performed study/trial whose results support the 'treats' edge).

Concrete examples of these competing proposals are presented below - showing how supporting trials would be represented alongside supporting publications for both approaches:

Option 1: Use the publications edge property and capture trials in same Attribute as pubs.

# Attribute containing pubs referenced by CURIE or URL (which here include NCT records)
{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",          # a publication reporting the results of one supporting trial
           "PMID:6815562",           # a publication reporting the results of another supporting trial
           "clinicaltrials:NCT00222573",
           "clinicaltrials:NCT00503152",
           "clinicaltrials:NCT00634963"
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "value_urls":  "https://clinicaltrials.gov/search?id=%22NCT02658760%22OR%22NCT02679560%22OR%22NCT05084573%22",
  "attribute_source": "infores:chembl"
}, 

Option 2: Use a supporting_studies edge property and create a separate Attribute object.

# Attribute containing pubs referenced by CURIE or URL (which here do not include clinical trial records)
{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",          # a publication reporting the results of one supporting trial
           "PMID:6815562",           # a publication reporting the results of another supporting trial
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:chembl"
}, 

# Attribute containing supporting clinical trials
{
  "attribute_type_id": "biolink:supporting_studies",           
  "value": [
           "clinicaltrials:NCT02658760",
           "clinicaltrials:NCT02679560",
           "clinicaltrials:NCT05084573"
           ],                                  
  "value_type_id": "biolink:Uriorcurie",    
  "value_urls":  "https://clinicaltrials.gov/search?id=%22NCT02658760%22OR%22NCT02679560%22OR%22NCT05084573%22",
  "attribute_source": "infores:chembl"
}

In making a decision, consider how the Clinical Trials KP being developed by the Multiomics team will represent these clinical trial records in their data (as information entities (pubs / records), or as research activities (studies/trials) - and aim to apply consistent semantics across these representations.

Also, consider that using a separate supporting_studies property makes it easier for the UI to find / count supporting pubs vs trials - esp if we start using other designators for studies besides NCT ids.

gglusman commented 1 year ago

Note the 'value_url' as shown ( https://clinicaltrials.gov/search?id=%22NCT03074773%22 ) doesn't work. The correct syntax for the search seems to be: https://clinicaltrials.gov/ct2/results?term=NCT03074773

To get to the trial directly, for the current version of the system, the URL would be: https://clinicaltrials.gov/ct2/show/NCT03074773

Looks like they're developing a new version, and under it, the URL seems to be: https://beta.clinicaltrials.gov/study/NCT03074773

...and the search syntax: https://beta.clinicaltrials.gov/search?id=NCT03074773

mbrush commented 1 year ago

Thanks Gustavo - I used the search syntax in my examples only because this is the format of the links we get from sources like chembl as provided by MolePro. Ideally we would want to point people directly to individual trial pages. And good to know about the new version in development. I;d like to find out more about this from you or Kamileh.

gglusman commented 1 year ago

To clarify, the new version appears to be just for their UI. You get prompted to try it when accessing the old one.

mbrush commented 1 year ago

Outcome of 5-23-23 EPC Call - general preference for 'supporting_study' as the long term solution, but may have to wait to implement until after September as KPs may not be able to regenerate data to be compliant with this specification before then. In mean time, modeling team will get required edge property into Biolink, and draft an initial specification for representing supporting clinical trials / studies in TRAPI.

mbrush commented 1 year ago

Closing with the creation of the supporting_publications_specification- but as noted in the spec, we need to return to this and modify the specification for how to reference supporting clinical trials. I created #447 as a separate ticket for this issue.