NCATS-Tangerine / translator-knowledge-beacon

NCATS Translator Knowledge Beacon Application Programming Interface plus Sample code
MIT License
7 stars 2 forks source link

Random access pagination or iteration? #59

Open lhannest opened 5 years ago

lhannest commented 5 years ago

@cmungall @RichardBruskiewich @vdancik Must we support random access pagination (getting a page of a given offset and size)? Would it be troubling to support only iteration (getting the next page of a given size)?

There may not be a bijective mapping from the knowledge source's records to the statements we want to extract. Sometimes I might infer a single statements from multiple records, or multiple statements from a single record. And sometimes I'm not able to apply filters when getting records from the knowledge source, and I must throw away records that don't match the filters. This is easy if we don't need to support random access, and pretty challenging otherwise. With NDEx we've been caching all results and then returning pages from the cache, but that seems like a pretty impractical solution.

vdancik commented 5 years ago

We chose pagination (offset + size) because it allows servers to to fulfill requests without need to keep internally the "status" of the requests.

lhannest commented 5 years ago

Maybe I'm not explaining very well, but I'm talking about a case where the server cannot fulfill the request without keeping some kind of information about the requests. That there is no function mapping an offset and size of beacon records to an offset and size of knowledge source records.

NDEx is an example of this (where you get pages of networks, and the size of those networks is variable), and so far I think Rhea's sparql endpoint is too--though maybe that's just because I'm not very familiar with sparql.

Another solution is to allow beacon responses to be larger or smaller than the requested page size. Maybe the requested page size could be treated as a suggestion rather than a requirement.

lhannest commented 5 years ago

Instead of an iterator key the server could pass back a next page token. This would be the best of both worlds. The server wouldn't have to keep track of each clients, but it would also have the freedom to use more than just size and offset to get pages. For NDEx the token could encode a network offset and an offset within that network.

RichardBruskiewich commented 5 years ago

@cmungall had the right idea in talking about the notion of a database "cursor". At the end of the day, it is all about simply retrieving all the relevant knowledge in this wild west of relatively boundless knowledge harvesting (graph processing can be NP-hard!). Streaming rather than random access satisfies this urge.

@lhannest's idea of returning a "next page" token containing information that is "server specific" state is fine (@hsolbrig had a similar thought at the hackathon, albeit, expressed slightly differently, more like a HATEOS "more data" URL).

That said, wouldn't one need to somehow account for all the parameters of the original query, not just some naked index into the data, like the (nDex) network offset and offset within that network? In other words, what constitutes the total "state" of a given query that informs the server to "continue" the work of retrieving a specific chunk of data.

Unless one encodes all such information into the return values, this still smells a bit like some kind of web server "session" state management. Maybe it doesn't make send to have the beacon server really completely abdicate cursor management responsibility.. maybe it still has to have some provisions to keep track of an "ongoing" query.

Even so, as @lhannest suggests, I suspect that it suffices to keep beacons "sequential streaming" rather than "random access paging", and let client software worry about presenting a cleanly behaved paging world to the end users.

For example, KBA, as a representative "client" of beacons, already "harvests" statements into its local Neo4j graph cache, thus making the data set better behaved with respect to offset/size paging, however, the ordering of the data is then based on Neo4j ordering, not the original knowledge sources.