Redundant query filters

steffilazerte commented 5 years ago

I'm adding in validity and redundancy checks (i.e not passing both country and statprov because statprov is sufficient). Validity checks are pretty straightforward (codes must be in the metadata lists, years must be between 1900 and present, julian dates within 1:366).

But when it comes to redundancy I thought I should perhaps check in with you two, in case there's something I've missed. So far I have these rules:

country > statprov > subnational2 are redundant, keep only smallest
minLat, maxLat, minLong, maxLong are not redundant with the above. I.e., they can mix with country/statprov/subnational2 because users might only supply one of the four options (just minLat for example).
utmSquare is redundant with all the above (so if utmSquare is supplied, minLat, maxLat, minLong, maxLong, country, statprov and subnational2 are all ignored)
siteCode, IBA and BCR are all redundant with all of the above (unless users might want to select, for example, a minimum latitude in a particular IBA?)

Any changes or additions?

denislepage commented 5 years ago

Those sound fine. In theory, some of these combinations could be valid, but in the optic of keeping queries simpler and more responsive, it’s a reasonable tradeoff.

Some warnings at least would be good if some parameters are being ignored. E.g. Statprov code provided, ignoring country code.

The instructions in the book should highlight that type of approach, indicating that the API is a coarse filter, and the they can apply finer grain filtering once they have the data. For instance, a regular request we have is for people wanting to do a data extraction based on a shapefile. We could tell them how to extract the bounding box of the local shape in R, send that as min/max Lat/Lon, and then do the overlay locally in R.

Thinking about this a bit more, and as I have hinted previously, I would also want to limit the number of filter concepts they are applying to no more than about 3 at the same time (I would count the 4 parameters of the bounding box as 1 concept for that purpose), and I think they should generally not be able to mix more than 1 concept of the same type.

It’s like a Chinese restaurant. They can pick up to any 3 among that list, and no more than 1 per category, or something like that:

· Species ID

· Geographic region (country, statprov, subnat2, bcr, iba, utm square, bounding box)

· Collection (collection code or project)

· Year (start/end year)

· Day of year (start/end doy). use that terminology vs. Julian date which starts at 0 and allows for fractions

· Site type (see below)

Am I forgetting any parameter?

To help the user, we could still do the sort of cleaning you suggested above (e.g. ignore country with statprov), and of the remaining list of parameters, run a validation against the rule for 1 per category / 3 category max. I know it is yet one more thing to force people to understand, but I also know that too many options will lead to time outs in many cases.

A couple possible parameters I would like to add:

1) Site Type (for now, only supporting the value IBA, to identify whether the site falls within an IBA or not). This may be a standalone parameter that can be combined with other geographic things except IBA site, since we may want to allow say all IBA data within a BCR or province. I think that sites outside IBA’s are saved as NULL, but I will confirm as I also see “N/A” strings. The SQL filter would then be something like “iba_site IS NOT NULL”

2) I will look into whether we can or should add a family parameter (e.g. all Anatidae). This would have to rely on an external table so I don’t know yet if that is of sufficient interest or feasible.

I think the “site type” is relatively simple and useful, so I’ll probably want to add that one.

Let me know Steffi if you have any questions.

steffilazerte commented 5 years ago

Okay thanks for the clarification!

I have just a couple of confirmations, comments, and questions :)

1) Any time I change a user request, I'll make sure a message prints to that effect 2) I think I'll revamp the function arguments to highlight these categories a bit more explicitly, that should make it easier for users to understand. 3) I've been using 'start_season' and 'end_season' rather than Julian day or day of year, to be explicit that this isn't bounding dates overall, but rather it bounds dates within a year. I'm happy to change it back to day of year, though, if you prefer. 4) One other set of parameters are the bmdeVersion and fields, do they count? Or are they separate? 5) I don't think having a family parameter is necessary, as long as an extensive species list is okay (something Paul and I have been discussion #11), I have added (locally, not pushed yet) an example showing users how to grab all species ids from a specific family and use that for the download. 6) I have added the example of filtering observations to a bounding box to the articles wishlist

denislepage commented 5 years ago

Fields and bmdeVersion shouldn’t be counted in the same list, since they are not filtering parameters affecting rows.

Yes, I think I would use day_of_year. Season is a bit more nebulous.

The bounding box example would also include the next step of filtering the local data against the shapefile, ideally.

steffilazerte commented 5 years ago

Sounds good!

pmorrill commented 5 years ago

I plan to start adding server-side validity checks to filter attributes this afternoon. My intention is to stick to the obvious 'out-of-range' checks to start with, and leave the more subtle stuff to the R-client (especially since you have gotten ahead of me on this!)

So for example, I will sanity check the following:

longitude and latitudes
startYear and endYear
startDay and endDay (but see my previous post/question

I already have some error traps for invalid bmdeVersion.

steffilazerte commented 5 years ago

I'm just in the process of double checking that I have all the filters in place with validity and redundancy checks.

Two filters that I'm a bit confused about are Site Type (siteType?) and Site Code (siteCode).

Site Type was discussed above, but isn't present in my index.html API cheat sheet. I assume the filter is siteType and that it takes an argument of either NULL or IBA, but could someone confirm that? Also, in the data, it is definitely filled with N/A strings, which might interfere with the function.

siteCode is in my index.html API cheat sheet, but wasn't discussed above, should it be considered a Geographic regional filter (i.e. fits in with country, statprov, subnat2, bcr, iba, utm square, and bounding box)? Or are we dropping it?

Incidentally, SiteCode is the one field in the downloaded data that looks like it should be a basic field (i.e. should be in snake_case) but is in fact in CamelCase.

denislepage commented 5 years ago

SiteCode. Yeah. ¯_(ツ)_/¯

I’m a bit hesitant about having fields with different names in the table and the API. An exception here would have something that Paul implements, since he is reading the table directly, not a view.

Site type: yes, the intent was to have no value (no filtering) or “IBA”. This would allow to get all data within any IBA in that case. I suspect Paul hasn’t implemented that option yet.

pmorrill commented 5 years ago

Site Type has not been implemented, correct. I don't recall seeing it in the initial spec, but I might have missed it. In any case, I will go ahead, yes?

Probably add a function to DataRequests object, to handle this.

API filter attribute will be siteType

pmorrill commented 5 years ago

If you build to bsc-base and deploy to sand box this can now be tested. (also, fixes re handling start and end days)

denislepage commented 5 years ago

OK deployed.

BirdsCanada / NatureCountsAPI

Redundant query filters #12