Do we plan to support api data queries for data on multiple collections at once?

pmorrill commented 5 years ago

Currently, the data request api entrypoint only supports querying on a single collection. This is in contrast to the list_collections entrypoint, which will return collection stats on multiple collections at once (ie: for a given filter set).

Presumably, once a user has seen the stats on a set of collections, the R-client can loop through them and run the get_data call on each in turn (that is probably how it works now - I admit I have not tried this). This approach has the advantage / disadvantage that if the user specifies a 'default' field list, s/he might get back different fields for the different collections, since the base default field set might not be the same for them all.

It has the disadvantage that each of these is treated as a separate query in the api servlet, leading to multiple DataRequests records being created - one for each collection code.

The server / database supports an extended model where the same query filters are used for multiple collections - this is how the web request form system operates. We could switch to using that model for the api, now that we are storing the DataRequests records in the database, for permanence.

Presumably, this would require some changes in the R-client (as well as the api servlet). The R-client could operate in one of two ways:

Loop through the set of collections, but use the same requestId for them all (the requestId is currently only used for pagination, so some adjustment would be required). In the servlet, this would allow me to add new DataRequestCollections to the specified 'formId' (aka requestId) as the queries are run. The advantage (I think) is that the bmdeVersion could be different for distinct collections..... though in the R-client this might be a headache to manage....
Alter the query so that it includes all collection codes as a vector in the initial get_data call. The pagination would then return records for all collections specified. This then means that the bmdeVersion specified would be the same for all of them.

pmorrill commented 5 years ago

A follow-up comment. The API now supports submissions of a formId that was generated by the web based data-request mechanism. These DataRequests will often be associated with multiple collections. In this case, the API only supports the 2nd model listed above, since it does not accept a collection code (or codes) as part of that query. Instead, it pulls the collection codes from the database (they were part of the original request), and as long as they have all been approved, it uses them all. This is precisely what option #2 above would be doing.

As I think about this, I'd prefer option #1, for all requestId associated requests. The R-client would have a list of collections associated with a query, and loop through them in turn (one at a time) using the same requestId for all of them, and providing the collection code on each query. (it could also provide the bmdeVersion and 'fields' parameters).

Note that the set of collection codes would come from either an interactive list_collections call (for newly generated api queries) OR from a list_requests call (for web-generated or old api-generated queries).

It sounds more complicated than it is, I think. Maybe we need a phone call to settle this.

steffilazerte commented 5 years ago

With respect to the R client, just a couple of points:

a) yes it loops through collections, currently with a new request id per collection

b) I'm not sure there's a real benefit in supporting different fields for different collections in the same call. The data all end up in the same data frame so missing fields for some collections will be created and populated with NA. If we don't want to support different field sets in the same call, I can always have R calculate the union of the field sets requested and send that as a single custom field at request. Just depends on what you prefer server side.

pmorrill commented 5 years ago

That's reasonable. I don't have a feeling for the importance of allowing different bmdeVersion parameters for the different collections in a query. It sounds like a bit of a nightmare, frankly. We will let Denis weight in on that.

The important question is whether we con continue to use your looping approach (which is probably a tidy way to do this) and support using the same requestId for each of the loops?

In effect, I would hand you back a requestId on every query, and as long as you are not changing any of the filter attributes (except collection name), you would just continue to use it.

If we do that, I think we might want a call to clean_request_id (or some such) - so that I can perform garbage clean-out when you are done with the requestId. This is because I am caching the dataRequest object in static memory and only updating the database when the object is no longer in use. If the client used that call it would signal that I can remove that object from static memory (it would persist in the database).

BTW: I think that a model where the get_data call accepts a collections parameter that is a vector is also possible if that sounds more reasonable.

steffilazerte commented 5 years ago

It should be relatively straightforward to make an initial call to get the request id outside the loop. But if the API accepts a vector of collection ids, this becomes even simpler.

Either way, I can then finalize the function with a call to clean_request_id. However, there should probably be a timeout or something, in case the function is interrupted before it gets a chance to make this call (it's pretty easy for users to interrupt R functions).

denislepage commented 5 years ago

I am not sure I am following everything here.

I had always intended the API queries to function like the web ones. The user first generate the data request object by defining filters, the API returns the user a list of matching collections and counts, and the the user identifies the ones among those that they want to download, making this a 2-step process. The user would presumably then pass the requestID value and a vector of collections to a get_data function in R. I am not sure I want to offer the convenience of allowing users to combine both the query and the data pull as a single call. The risk here is that they may be trying to download tens of millions of records without realizing it.

Internally, I envisioned that the R script would then iterate through each collection one by one and page through all the records matching the filters of that query, using the same requestID for each collection. In cases where bmdeVersion = default, then yes, I expected that the data returned by the API would differ among collections.

steffilazerte commented 5 years ago

With respect to the first comment:

Right now users can put in a request to see the data counts (nc_count()), but they can also skip this step and can see the counts and download the data all at once (nc_data_dl()). The failsafe here is that the first thing they see is a list of records to be downloaded, giving them a chance to abort the download if it's bigger than expected.

If you want another fail safe, I could put in an interactive query if the number of records is, say above 50 000, asking the user if they wish to continue or abort.

However, I wouldn't recommend making the user collect the request id by hand and then supply it to another function. It's clunky and makes it much harder to have reusable scripts which is one of the main benefits of an R package (this is where R packages differ from interactive online forms).

We can still prevent accidental downloads in other, more script-friendly ways.

Perhaps it would be useful to take a look at Danielle's workflow with the package and see how that works for her productivity and how it matches with what you envisioned?

pmorrill commented 5 years ago

I don't think we are far from that behaviour. The requestId is generated only on the first get_data call, which the R-client sets up after the user has made an initial query to count records. I am not sure if the R-client code enforces that order. Presumably you have tested the R-client more than I have.

What we are missing is the looping behaviour that we are discussing above, but it should be relatively easy to set this up. I just have to validate the collection codes to ensure they were part of the original request, or at least that the user has access to them.

pmorrill commented 5 years ago

Steffi; I am working through ways to organize this and coming around to the following (which probably accords better with Denis' assumptions anyway):

We start generating the requestId on the list_collections call
The server then would require a valid requestId on the get_data call

The requestId could be from a previous list_collections call, or from a web request form. In either case, you would only have to provide the requestId, and the name of the collection you want to query. That collection would have to be from the set that was included in the original list_collections call (or, in the case of web requests, it would be validated against the 'approved collections' list for that request, as stored in the database).

Does that sound feasible on your side? Does it sound like it provides what we want? On my side, the Java code is somewhat / substantially simplified if we go this way.

steffilazerte commented 5 years ago

Internally, the nc_data_dl function calls list_collections then get_data, so requiring that the request id come from list_collections would be very easy to deal with in the R package.

denislepage commented 5 years ago

I think there are at least 2 different user approaches here. One, for users who aren’t exactly sure what they are getting, and the other for analysts like Danielle who are using this to access known data.

I agree that the latter group should be able to do this in a non interactive way.

Perhaps the approach we need is a prompt that appears by default when the user is trying to download more than X records in a single query (1 million across all collections?). But there could be a R parameter that allows disabling that warning for more advanced users. That parameter could be offered in the warning, which should happen right at the start before the data is pulled.

Warning: your request includes more than a large number of records (5.5 million from 21 datasets). Do you want to proceed? To remove this warning, you can use the warning=FALSE option.

Something like that.

Our analytical code currently typically pulls 1 species from 1 collection at a time, runs the script and move to the next species and/or collection. It will likely be more efficient (less prone to disconnections) to build a local database first, and then loop by species from the local db.

pmorrill commented 5 years ago

Something that I am not clear on: when do we require authentication (ie: a token preset in the request)? In the original api spec, the list_collections and get_data calls were marked as authentication optional. But, I believe we want to log all get_data calls, meaning that we need an authenticated user for that call, so we need a token.

I would like to move to that: authentication required for get_data queries. That leaves the question of the list_collections query. This is the initial query made to explore the record counts for a given set of filter attributes. Should it also require authentication?

Possibly the R client already enforces an authentication required policy, but I would like to do the same thing in the servlet. That is, require a token for the get_data call, and possibly for the list_collections call. Steffi, is this enforced in the client?

pmorrill commented 5 years ago

Internally, the nc_data_dl function calls list_collections then get_data, so requiring that the request id come from list_collections would be very easy to deal with in the R package.

I am going to set this up today, so that the list_collections call will return a requestId, that can then be used in the get_data call. This will change the get_data call slightly, in that it will

now require a requestId
ignore filter attributes apart from the collection attribute

It will continue to support the following:

lastRecord
maxRecords
bmdeVersion
fields

For the moment, I would like to stick with allowing only a single collection per get_data request (rather than a vector). As I said above, it is a bit hard to support bmdeVersion = default otherwise.

denislepage commented 5 years ago

That all sounds great.

Yes, let’s keep the single collection requirement on get_data.

It keeps things simpler in several respects (permissions, paging, field list, etc.)

denislepage commented 5 years ago

I don’t think I intended the get_data calls to be authentication optional, so I would agree about that.

I would tend to leave the list_collections call optional. This would allow people to explore data even without a login. We are still not saving those in the database until they actually call get_data, so there are no current requirements to have a user ID at that stage.

pmorrill commented 5 years ago

Yes - I think letting them explore without a token is ok. But, I think that in that case (anonymous list_collections) I will NOT return a requestId, and not cache the DataRequest, since they cannot proceed to get_data anyway.

I suppose the R-client can prompt them to sign in, and then it could just repeat the list_collections call, providing the token, and then they'd be good to go.

denislepage commented 5 years ago

Good point. Sounds fine.

As you say, this could potentially be handled by the R client.

steffilazerte commented 5 years ago

@denislepage That's a perfect compromise, when I get back I'll add a interactive check that can be turned off with an argument.

@pmorrill With respect to authorization, right now the R client requires a token for get_data and optionally for list_collections. The R function nc_count uses a mix of list_collections and list_permissions to return the counts associated with all collections, or just those associated with a particular token, depending on the function arguments. I wrote this function very early on and can't remember the output of list_collections right now. Perhaps it doesn't actually need a token? Or perhaps it could be (or has been?) modified so that if a token is supplied it returns only the collections the user has permission for?

The conclusion of the discussion in issue #13 was that downloading observations (get_data) should require a token, but it should be optional for getting counts (but the R client will enforce the need for a token when using list_collections to get a request id for downloading).

If this makes no sense, I'm happy to give you a call so we can sort it out.

As for the changes to get_data, they sound good to me, but will break the R client. I'm traveling now and won't be able to fix it until I'm back in my office (next week). However, I'm pretty sure that Danielle is the only one using the package, and so it probably won't matter (@denislepage?)

denislepage commented 5 years ago

No one else is currently using the package.

steffilazerte commented 5 years ago

Haha, took me a while to write that response so you got ahead of me, it all sounds good to me though.

pmorrill commented 5 years ago

As for the changes to get_data, they sound good to me, but will break the R client. I'm traveling now and won't be able to fix it until I'm back in my office (next week). However, I'm pretty sure that Danielle is the only one using the package, and so it probably won't matter (@denislepage?)

Not a problem. I will update the documentation and also send you a short list of changes that would affect you directly. Most will be minimal, and won't lead to major re-writes.

I may check some bits in before you are back, just to help stay organized, but we do not need to deploy to sandbox.

BirdsCanada / NatureCountsAPI

Do we plan to support api data queries for data on multiple collections at once? #23