Make some Beacon-Aggregator API end points asynchronous queries

RichardBruskiewich commented 6 years ago

Query latency is a fact of life, as are periodic internet failures. In both cases, the beacon-aggregator performance degrades seriously.

One strategy to possibly deal with this is to make problematic queries asynchronous in nature and introduce query management into clients.

The targets of such a design pattern would be API endpoints which take a significant amount of tie to execute, namely, data retrieval endpoints polling over multiple beacons. Metadata endpoints don't generally have this constraint.

What we might consider here is to implement the following query workflow in the beacon aggregator:

POST query parameters, returning a Query ID along with a query status
Use the Query ID for iterative polling GET call to return the query status
When the query status says "data is ready" then GET the data

One idea for step 2. query status is to return the status as a JSON array of status records, one per beacon queried with object property 'beaconId', an HTTP status code plus an integer count.

The beaconId is the beacon-aggregator index number of the beacon
The HTTP status code is more or less as expected:
- If the HTTP status is '200' then the data is deemed 'Ready" from the given beacon.
- If the HTTP status is '102' might indicate that the server has received and is processing the request, but no response is available yet. Although this is a WebDAV specific code, the meaning is quite analogous in our system.
- Other HTTP Status codes could be returned that report other beacon HTTP errors or status reports
If the status code is 200, then the integer count would be the positive number of query hits for the specified beacon (could also be zero...).

cmungall commented 6 years ago

I like it

Would it be possible to ask the aggregator for results-so-far? This way a client can start showing useful info, but still inform the user it may be incomplete

RichardBruskiewich commented 6 years ago

Thanks Chris.

I suppose step 2) above is somewhat independent of step 3) in that clients can use step 2) results to feedback back query status information to the user, but if the user executes step 3), the default would be to return any available results, presumably indexed by beacon id. If a given beacon's results are not yet available, that beacon id would not show up in the list or if showing up, the entry would simply be an empty result.

A slight refinement of this would be to allow the step 3) API call to be constrained by a subset of beacon id's (such as we have now in our endpoints). A well designed client might then only request available results once (to economize on bandwidth) or access new results only, as step 2) says that they become available.

It might also be helpful for clients to keep track of step 1 POST parameters, so that when they see the step 2) query hits count, they can iterate to refine the query without GET'ing the data, if the original query count is very high, then only retrieve the data once the count is tractable, implying a more precise set of hits?

NCATS-Tangerine / beacon-aggregator

Make some Beacon-Aggregator API end points asynchronous queries #33