We need logging of the DataRequests generated by the R-client

pmorrill commented 5 years ago

I am separating this issue out as it is buried in another thread.

My intention is to start writing the DataRequest records to the Dendroica tables, when they are first generated by the R-client. At least that way you can perform queries to monitor usage. I will write the standard record format as it now exists in Dendroica.data_requests.

However, I have added 3 fields to the DataRequest object, that will not be reflected in the filter attributes: filterStartDayOfYr, filterEndDayOfYr, filterLocalityType.

If you add these to the Dendroica.data_requests table, I can doctor the DataRequest.java file to save and retrieve them. Is that how we want to handle this?

denislepage commented 5 years ago

You beat me to it. I have been meaning to revive this thread too.

I can put support for the additional fields. I also wanted to add one to track the origin of the request (web vs. api). I intend to only generate ZIP files for web requests, so I will need to modify the AKNQueue accordingly. Should be relatively simple.

I was initially thinking we’d probably convert the day of year fields as day/month for the database, and revert it when loading, but maybe it’s easier to just have them in their own fields. Looking at the SQL filters, the DOY fields will be used in predominance over day and month if both are present. I think that’d fine.

It would be useful to avoid updating the database table for every API call. I also think that we should only track the download calls, not the counts.

I made some changes as follow:

there’s an origin field to differentiate between web (default) and api requests. When creating an API request object, the method setOriginAPI() should be called (a constructor could also be used).
if the request is loaded from the database with a requestId, and the request origin is “web”, then the request origin should be changed to “mixed” with setOriginMixed().
the getSQLToInsert method now includes start/end doy, locality type and request origin.
I created a temporary insert procedure what the added fields (prInsertDataRequests2) and applied it to the getSQLToInsert method. We can revert to the main one (prInsertDataRequests) once the sandbox and live servers are in sync.
the setValue method now sets the 4 new fields from the database
The DataRequestsCollections class already has recordCount field, which we should ensure is being populated with the initial count method. There’s also a method called setDownloadedDt(Date), which should be used when the download method is called on that specific collection.
The finalize() method in DataRequests now check if the object is from the API (api or mixed origin), and check if any of the collections have been accessed (based on the downloadedDt field), to decide whether to save the object in the database. For now, I am writing all collections, even those that have not been accessed. The object will only save if the constructor with a projInfo has been used, or the projInfo and requestId have been manually set later (see below).

Outstanding issues (#1 for Paul, the rest mostly for me):

[ ] Because the empty constructor is currently used for DataRequests, the projInfo and the requestId will be both null, preventing the object from saving. We could either start using the constructor with ProjInfo, or manually set these values whenever it is determined that the object will be in need of saving.

datareq.setProjInfo(projInfo); datareq.setRequestId(Util.getNewId(getProjInfo(), "bm_data_requests"));

The only downside of using the projInfo constructor every time is that it will generate lots of request ID’s that never get saved. Probably not a major problem, but maybe the manual approach is better.

[ ] Modify the AKN queue to only create zip files with web and mixed requests
[ ] Make the request Id visible in the email notifications and on the web page.
[ ] Make a reference to the R package where appropriate (when things are ready).
[ ] Revert to the prInsertDataRequests proc.

pmorrill commented 5 years ago

One thing I noticed, is that the Dendroica field 'request_origin' can be null. But your enum does not resolve in that case, throwing an exception (line 2088 of DataRequest). Should there be a default of 'web' on that field?

denislepage commented 5 years ago

Yes, web should be the default.

pmorrill commented 5 years ago

I am concerned about using the 'finalize' function to trigger a database insert of new DataRequests. In my own system, this is not reliably triggered. Garbage Collection can be a little unpredictable, and even delayed considerably.

While it is possible to call finalize() explicitly, it is not recommended.

So I would suggest we move that functionality out of finalize and into it's own function, so it is easier to call on demand.

denislepage commented 5 years ago

Sounds good.

Here’s another idea that avoids the fickleness of the finalize method.

Initial request is created: DataRequests object is instantiated, with the list of possible collections matching the filters. Update method is not yet called to save to database. No point saving requests that are not actually downloaded at least in part.
When first page of individual collection is accessed, sets the downloaded_dt for that collection in the bean (to be safe) and then checks if the whole request object has already been saved or not:

a. if not (this is the first collection), run the update method on DataRequests

b. if yes, only update the downloaded_dt of the specific collection to match that of the collection in the bean.

You can call the exists() method to determine if the object has been saved. A new instance created within an existing ID (and well as instances created without a VALID request ID) has the field exist set to false. When you call the update method, the field is changed to true.

It’d look something like this:

DataRequestCollection drc = request.getCollection(collectionCode); if (drc != null && start == 0) { if (!request.exists()) request.update(); else try (SQLPreparedStmt ps = projInfo.getDb().getSQLPreparedStmt(projInfo, “update requests_collections set downloaded_dt = ? where request_id = ? and collection = ?”);) { drc.setDownloadedDt(new Date()); ps.setParameter(1, drc.getDownloadedDt()); ps.setParameter(2, request.getRequestId()); ps.setParameter(3, drc.getCollectionCode()); ps.executeUpdate(); } }

pmorrill commented 5 years ago

I am about ready to add the code that supports full DataRequests db record creation for api generated queries. Here are some notes and few questions.

(Steffi: much of what follows it is server-side talk. But there will be a few small changes to the api spec that arise form it: I will describe those in a separate email or posting once these changes go into the sandbox

Record creation

When a new query is initiated by the api - assuming the attribute validation suceeeds - a new DataRequests object is created, using a new 'formId' generated by Util#getNewId. A DataRequestCollections object us also added, corresponding to the collection code requested. I set the record count on that DataRequestCollections, and the status as 9 (approved), then add it to the DataRequests object.

If the query succeeds and data is delivered, I write the RequedstLabel as 'API Request' and set the requestOrigin as 'api'. I also set the RequestDate. Then call the 'upsert' function to push the records to the database.

Request re-use

Every successful request will now be written to the database, which means that the user can later re-run any request, whether generated by the R-client or on the web forms. In either case, the user can provide a bmdeVersion parameter, and a 'fields' parameter (if bmdeVersion == 'custom'). And any request will also be paginated, using the lastRecord and numRecords parameters.

Anytime a request is run, I am calling the DataRequestsCollections#setDownloadedDt function, and running an upsert. This means that if the R-client splits a request over several pages, the upsert call will be run several times. (Note that I have not yet considered Denis' latest post above, with the alternative way to manage the upsert event.....I will do that)

If a request origin was set as 'web', it will be switched to 'mixed' as soon as it is run through the API.

The api entrypoint 'list_requests' will now respond with all types (web, api, or even mixed). I am including the requestOrigin value in the list returned to the R-client, and we may consider adding some filtering to the 'list_requests' api entrypoint. For example, we could allow filtering to include only web generated requests, etc.

Data Collections in a Request

Requests generated by the R-client only ever include a single collection code, and the DataRequestsCollections status is always set to 'approved' (we have already checked the user's access to the collection).

But, requests generated by the web forms may include multiple collections, and it is possible that a subset of those collections will not yet be approved when the query is run. This is handled by checking for approval on the set of collections, and only including the approved ones in the sql query.

Data Requests are also validated against the client userId, to prevent X from running a request approved for Y (or even generated by Y).

That's the basic system, which I will check in tomorrow if my testing is good this afternoon.

denislepage commented 5 years ago

Just to clarify, a single request can have multiple collections included, but we should probably only keep track of collections that are actually accessed for data. Why would we not include multiple collections in the same request?

If someone says “Show me all data records for American Robin”, but then only proceeds to download 4 of the collections, then only those 4 should be saved, but I would not generate 4 distinct requests objects for that. The request object is meant to define the properties of the query, which can span several datasets.

You then would need to create a new instance of DataRequests when the initial query is sent (the one that returns a count of records). Ideally, this isn’t saved, but cached in memory? I’m not sure we have resolved this yet. If not saved, this potentially limits the accessibility of the request ID we return (to however long we decide to keep them in memory). There may be value in saving the request details, but perhaps without including any of the collections? To discuss

When page #1 of a download call is made for any of those collections, a new DataRequestsCollections object is added, with the properties as you describe, and the DataRequests object is saved/updated to the database.

To follow up. I may be overlooking some challenges.

pmorrill commented 5 years ago

Our api spec only allows ONE collection per data request. That's how it was set up. We can make changes of course.

I do not save a query request when only a count of records is generated. The system only saves a request object to the database when actual data is downloaded, though it updates it on subsequent pages - which seems unnecessary.

DataRequests are not being cached in memory during pagination right now. The formId (requestId) is used to pull them from the database on each page-call. I will be considering ways to streamline that tomorrow / Wednesday. Caching in a static object works, but we need some sort of time-out and garbage collection.

None of this specific code is in Git right now, as I have not finished my testing.

BirdsCanada / NatureCountsAPI

We need logging of the DataRequests generated by the R-client #22