NCATS-Tangerine / translator-knowledge-beacon

NCATS Translator Knowledge Beacon Application Programming Interface plus Sample code
MIT License
7 stars 2 forks source link

Proposal for API version 1.3.0 #62

Open lhannest opened 5 years ago

lhannest commented 5 years ago

Beacon API proposals:

  1. Change the evidence date to a json object {"year" : 2015, "month" : 4, "day" : 23} rather than a string. If we do this then there is no possibility of different beacons formatting dates differently, and will allow applications to get this information without having to parse date strings.
  2. Add some sort of qualification flag/status for statements that have been inferred through either deduction or heuristic. What is the best way to display this?
  3. Optional sub-graph ID? Some knowledge sources (like NDEx) are really a collection of independent knowledge graphs. In that case we might want to be able to choose which of those sub-graphs to query.
  4. Replace the statement source and target filters with subject and object filters. See https://github.com/NCATS-Tangerine/translator-knowledge-beacon/issues/61, https://github.com/NCATS-Tangerine/translator-knowledge-beacon/issues/60
  5. Set appropriate minimum and default values for size and offset. It makes the logic easier if we don't have to handle null values, and it prevents an accidental dump of the whole knowledge source. We may wish to set a max size as well, something like 10,000?
  6. Metadata endpoints should report the total number of nodes (concepts) and edges (statements)
  7. Concept and statement details endpoints should take a list of identifiers and give you a list of detail entities.
  8. Bring back synonyms on concepts endpoint. This way we can display gene symbols as well as protein names.
  9. Replace/complement random access pagination with a next page token. NDEx, for example, only supports pagination over networks and not over the nodes and edges in those networks. The NDEx beacon's next page token could represent: {network=5, offset=62, size=800}. Thus allowing each beacon to implement pagination with as many parameters as it needs. https://github.com/NCATS-Tangerine/translator-knowledge-beacon/issues/59. Along with the next page token we can return, whenever possible, the total number of records for that query.
  10. Remove fields from metadata endpoints: /categories remove uri, local_id, local_uri, add local_category. /predicates remove id, uri, local_id, local_uri, local_relation.
  11. Remove separate details endpoints. Instead have the response fields be configurable. User can pass in a list of fields, and those fields will show up in the response.
  12. Add an endpoint that returns metadata about the beacon (rather than the knowledge graph), like its name, its github page, a wiki or jupyter notebook explaining how to use it (maybe all beacons can share one), who to contact about it, and a link to the knowledge source it wraps.

Aggregator API proposals:

  1. Add publication metadata to evidence, including the publications title, abstract, authors, journal, volume, issue, page numbers, page count, reference count, and language. See https://github.com/NCATS-Tangerine/translator-knowledge-beacon/issues/56.
  2. Use the publication metadata to implement a statement score, and display that score in main statements endpoint. An initial scoring mechanism could be something like this:
    
    score = 0
    for line in abstract.split('.'):
    score += line.count(subject_name) * line.count(predicate_name) * line.count(object_name)
    score = sigmoid(score)
cmungall commented 5 years ago

Re: different formatting of dates. We have ISO-8601, seems overkill to normalize dates to an object?

On 24 Oct 2018, at 12:11, Lance Hannestad wrote:

Beacon API proposals:

  1. Change the evidence date to a json object {"year" : 2015, "month" : 4, "day" : 23} rather than a string. If we do this then there is no possibility of different beacons formatting dates differently, and will allow applications to get this information without having to parse date strings.
  2. Add some sort of qualification flag/status for statements that have been inferred through either deduction or heuristic

Aggregator API proposals:

  1. Add publication metadata to evidence, including the publications title, abstract, authors, journal, volume, issue, page numbers, page count, reference count, and language. See https://github.com/NCATS-Tangerine/translator-knowledge-beacon/issues/56.
  2. Use the publication metadata to implement a statement score, and display that score in main statements endpoint. An initial scoring mechanism could be something like this:
    
    score = 0
    for line in abstract.split('.'):
    score += line.count(subject_name) * line.count(predicate_name) * 
    line.count(object_name)
    score = sigma(score)


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/NCATS-Tangerine/translator-knowledge-beacon/issues/62
RichardBruskiewich commented 5 years ago

Roger that. I noted that the earlier examples in the Translator Knowledge Beacon API swagger spec are obviously assuming ISO-8601, so I guess we were originally on the right track.

@lhannest I guess we simply follow the ISO-8601 yyyy-mm-dd for the date field, and assume that the client will handle a JSON data string with this value accordingly.

RichardBruskiewich commented 5 years ago

Oops.. reopening to allow comment on the other items... forgot that this was a general 1.3.0 API proposal issue!

micheldumontier commented 5 years ago

agreed. use ISO standard for dates

srensi commented 5 years ago

Would it be possible to either (a) add "confidence" field to statements return or (b) move qualifiers field from evidence to statements (preferred)?

lhannest commented 5 years ago

@srensi I think that's a good idea. I imagine in most cases we will need to qualify statements only because we've drawn an inference, and not because the knowledge source is reporting low confidence. If that's so maybe an optional "inferred_by" field that holds a description of the process by which this statement was inferred would be best. The presence of the field can then be taken to indicate lower confidence.

RichardBruskiewich commented 5 years ago

Hi @srensi,

I've been pondering your and Lance's commentary, and wondering to myself what the core use case is here?

@cmungall might wish to clarify the usage of "qualifier" in the Biolink model (after which the current beacon statement outputs are modelled).

The latest API (which has statement details rather than a bare evidence endpoint), may actually document a list of citations which can actually be any internet-resolvable (CURIE or URI pointed) resource, tagged by ECO evidence type (http://purl.obolibrary.org/obo/eco.owl).

I don't know how one translates this into a simple statement confidence measure. I suppose one could contemplate adding some kind of (optional) score to the citation (akin to a confidence measure generated by the citation source, which could be a software program, e.g. reasoner, etc.). Even with that, do we decide to somehow propagate this confidence score upwards from details into the main initial basic statement result, as some kind of score?

srensi commented 5 years ago

@RichardBruskiewich

I think your use case comment is quite pertinent. The main thing I was thinking about was filtering edges by confidence score. But honestly it's unclear whether this is something anyone actually wants/needs to do, and so probably not worth doing any work.

If there were going to be something, then maybe number of supporting citations? Something like @lhannest suggested with "inferred" or even "validated" (meaning a human has inspected evidence and verified the association) might be interesting. But I find myself agreeing with you in that it's unclear whether this is required by any use case worth supporting.

RichardBruskiewich commented 5 years ago

8. Bring back synonyms on concepts endpoint. This way we can display gene symbols as well as protein names.

The concept details call still reports synonyms and exact matches.

I understand that here you mean that the list of concepts should return additional classes of related names which are "user friendly", e.g. HGNC human gene symbols rather than Uniprot (protein) accession id's. But, in effect, gene concepts and protein concepts, although tightly coupled, are distinct conceptual entities within the "Central Dogma" of biology. We should probably keep them distinct in the knowledge graphs returned.

Of course, if the concept returned is a protein, then the Uniprot identifier may be its real identifier, and the protein concept may have an associated statement "encoded_by ".

Then, in that case, it is the user client business logic and interface which should forge the connection. A similar case came up in the past with respect to "taxon". We were going to embed it in the API output but then decided to keep it as an associated statement about the concept being discussed.

@cmungall any thoughts on this?

RichardBruskiewich commented 5 years ago

#5. We should set appropriate minimum and default values for size and offset. It makes the logic easier if we don't have to handle null values, and it prevents an accidental dump of the whole knowledge source. We may wish to set a max size as well, something like 10,000?

For practical reasons, I totally agree. That said, there are a number of confounding concerns:

1) Different beacons may have different threshholds of pain for returning results. A sensible limit for one may be too onerous for another.

2) The Knowledge Beacon Aggregator actually hard codes such limits in its calls to beacons, in the hundreds of entries (not 10,000... that feels a bit high...)

3) We previously removed formal paging from the beacons, since it seemed to be more of a client-side concern. At the Portland hackathon, though, we reintroduced the notion of a "cursor" for back end data retrieval. I wonder if this is enough.

4) If there is a concern for a given Knowledge Source that a user might "ask for the whole database" then perhaps, each knowledge source can limit its output to some sensible limit. Would having a way of signaling to the client that "I have more for you, if you want it..". Harold Solbrig (?) suggested a kind of HATEOS 'next' link be returned, if more data was indeed still available, as an alternate mechanism to manage the data flow from beacon to client.

RichardBruskiewich commented 5 years ago

#3. Optional sub-graph ID? Some knowledge sources (like NDEx) are really a collection of independent knowledge graphs. In that case we might want to be able to choose which of those sub-graphs to query.

This interesting idea came up once before, right out of the NDEX group itself. It certainly does have some merit in the light of NDEX, and could be relevant to other huge knowledge repositories of interest thus making knowledge retrieval somewhat more tractable from such sources.

It does, however, beg the question: how does one discover the relevant sub-graphs in the first place? NDEX has its "network" API endpoints to do this. Also, how does this idea apply to other types of knowledge sources. Do we assume a "default" namespace equal to the whole knowledge graph (of these other knowledge sources), but simply not have such a default for NDEX?

RichardBruskiewich commented 5 years ago

@srensi Thanks for your feedback on the "confidence" use case issue.

I do see some value in getting a handle on the reliability of statement assertions (a.k.a. knowledge "edges"). The notion of tracking a "count" of citations is actually a sensible idea.

In fact, a count of citations was somewhat nearby in our original Knowledge.Bio system which was based on a Neo4j database before we joined NCATS and began the Knowledge Beacon distributed network of knowledge retrieval. In that legacy system, every statement was a reified edge-node pointing to its subject, object, predicate and an associated "evidence" node, which was itself a n-ary reification of all the related citation nodes, but locally cached the count of the number of citations plus the "evidence code". This graph model allows us to quickly retrieve the count of citations for display alongside each statement in the the Knowledge.Bio "statements" results list.

It is, of course, a derived quantity: the count of all the citation CURIES in the array of the evidence sent back from the "statement details" ( /statement/{statementId}) endpoint but perhaps, although not part of the Translator Knowledge Graph data model, the published edges (statements) could cache a count of their evidence, as a convenient (albeit shallow) measure of statement support.

srensi commented 5 years ago
  1. I actually kind of like the details endpoints. Why the change?
lhannest commented 5 years ago
  1. I actually kind of like the details endpoints. Why the change?

I figured users would like to be able to get all the fields they might need in a single request. That may save them time if they're trying to download a large portion of the beacon's data.

But now that I think of it, we shouldn't remove the details endpoint. We would still want to be able to retrieve records by their ID's. And I'm starting to doubt that whatever speed improvement this might bring is worth the added complexity.

RichardBruskiewich commented 5 years ago

Maybe I can chime in here and refresh a bit of background on the philosophical origin of the beacon API.

First, the Beacon API arose out of the Knowledge.Bio web application that preceeded Ben Good and my (STAR Team) involvement with NCATS. The underlying design pattern was a simple workflow of:

1) Identify a concept of interest by minimal metadata. 2) (Optionally) get details about that concept 3) Get statements containing a selected concept as subject or object of the (predicate) relationship 4) Get evidential support for a given statement (e.g. a list of PubMed citations)

At the time we started implementing this workflow as an MVC stack, we reckoned that in step 1), we should return a paged (possibly filtered) list of concepts with just enough information - concept name, category, maybe a brief description - allowing the user to identify exactly which concept was of interest to them. We wanted to avoid swamping the web-server bandwidth with all of the metadata associated with all the concepts matched. We figured that users would simply access such extended metadata one concept at a time (i.e. Step 2). We were not, at the time, thinking in terms of a database dump to a Jupyter Python batch client.

Similarly, for statements (i.e. predicate relationship "edges" with subject and object concept "nodes"), we thought it sensible in step 3) to simply return a paged, possibly filtered list of statements which had a single chosen concept as either "subject" or "object". Since each statement could be associated with a significant list of supporting evidence citations (i.e. pubmed entries), again, we didn't want to swamp the bandwidth by sending them all back with the original list of statements. Rather, we left retrieval of such supporting details to a statement-by-statement "evidence" (now "statement/details", Step 4) call.

Of course, the above workflow has evolved beyond this basic vision and is now beacon API web services driven. But still, if one knows the origin design intent and operating constraints of this workflow, then the division of labor between basic concept and statement calls, versus the concept/statement details calls becomes a bit more understandable (and, we can sensibly debate if and how we may wish to modify it).

srensi commented 5 years ago

Yes, I really like this workflow. In fact, I wish there were a way to incorporate something like this into the Reasoner spec. But it does create a tradeoff between the workload on the API and downstream complexity.

To give a concrete example...

One of the things that is leading to timeouts and crashing my neo4j (issue raised by Chris on 11/27 call) for my reasoner is pulling all the evidence for all of the edges for a large number of reasoning pathways. I use a cache to eliminate repetition where edges are shared between paths, but that only goes so far, and it still leaves the issue of passing large amounts of data.

I would prefer to (1) return paths with statement IDs (e.g. pointers to evidence), then (2) have someone hit the details endpoint of my beacon on an as needed basis, with caching on the client side. But this introduces extra complexity for the downstream consumer, who may be far removed from the point of the initial reasoner query. They need to know (a) that they need to query my details endpoint, and (b) how to do it. How much do we trust them to get it right? Is there some way to "foolproof" this workflow?

RichardBruskiewich commented 5 years ago

Thanks again @srensi (@cmungall cc:'d here)

Yes, getting a minimal specification of the subgraph - a list of concept and statement id's with perhaps just a few labels (e.g. node names and categories plus relationship predicate names) is bandwidth efficient.

I'm assuming here that downstream consumers of the NCATS system will access the system through clients abiding by an NCATS published interface: Reasoner API, beacon API related or otherwise.

The most important criteria is that all of these concept and statement ID's should be stable, resolvable CURIEs, independent of the any underlying query which discovers them.

We also have to assume, though, a stable endpoint exists somewhere to offer "details" retrieval using those identifiers. If your Reasoner API was the original source of those identifiers to the consumer, then yes, perhaps the Reasoner API needs to specify something like a /details access.

In the beacon world we have been building in NCATS, this interface could be individual beacons, but I'd suggest that perhaps, for ease of access, that this could also be a stable "Knowledge Beacon Aggregator" (KBA) web service, which host /concepts/details and /statements/details endpoints, across a whole registry of known beacons (so, in principle, a client doesn't have to guess where to look, in the beacon world at least, for the details they need).

Whether or not KBA is eventually subsumed behind a common "Reasoner" endpoint or not is an NCATS architectural decision to be made.