BirdsCanada / NatureCountsAPI

NatureCountsAPI
0 stars 1 forks source link

Day of Year queries #13

Closed steffilazerte closed 5 years ago

steffilazerte commented 5 years ago

Right now the day of year queries work like this:

data >= startDay & data <= endDay

Which is perfect if you want to collect all data from the summer, or migration (i.e. startDay < endDay; 60 to 300). However it doesn't work if you wanted to collect only winter days (i.e. startDay > endDay; 300 to 60).

Would it be possible (and useful) to have the API do:

IF   startDay <= endDay    THEN    data >= startDay & data <= endDay
IF   startDay > endDay     THEN    data >= startDay | data <= endDay

If not, I can make this possible in the R client, but it would involve querying the data twice, once for the end of the year, once for the start, so I thought it might be easiest in the API.

pmorrill commented 5 years ago

Wouldn't we also have to set the startYear and endYear different to accomplish what you are describing above? This leads to another related question I meant to ask:

What does this mean (do we allow it?):

startYear:2011 endYear:2015 startDay:200 endDay:300

Does it mean we want all data from DOY 200 in 2011 until DOY 300 in 2015?

Or, does it mean we want data from DOY 200 to DOY 300 for each year between 2011 and 2015 inclusive?

It seems that we should stick with the first interpretation, or it becomes too complicated. In other words, all DOY ranges are considered to be continuous.

With that in mind, your query becomes:

startYear: 2015 endYear: 2016 startDay: 300 endDay: 60

And the interpretation is 'return all data from DOY 300 of 2015 until DOY 60 of 2016'

steffilazerte commented 5 years ago

I think that the second option (200 to 300 for each year) really adds useful functionality because it allows users to download seasonal data (this is the current implementation, I think). It's really easy to filter data to a specific date range (the first option), but filtering each year after downloading can be tricky and means you have to download a lot more data in the first place.

pmorrill commented 5 years ago

You are right: that is how it works now. The SQL query ends up performing a 'AND' on the two ranges, so for example:

where startYear >= 2011 AND endYear <= 2015 AND startDOY >= 200 AND endDOY <= 300

As you say, this will return the data set matching my 2nd scenerio above. In fact, the first scenerio is the harder of the the two to query.....

steffilazerte commented 5 years ago

Exactly, and this is great, the only thing I was wondering was whether it would be possible, if endDOY was < startDOY, to change that query, to:

where startYear >= 2011 AND endYear <= 2015 AND (startDOY >= 300 OR endDOY <= 200)

This would allow selections centered on New Years, as opposed to on midsummer.

But if that's overly tricky, I can figure it out in R.

pmorrill commented 5 years ago

I will look into that - the logic we are discussing is encapsulated in a class that Denis uses on his website filters too. I am not as familiar with the options there. But I'll check into it.

denislepage commented 5 years ago

Yes, the day of year and year parameters should be distinct.

My intent was to have this query:

Year between 2001 and 2005 DOY between 100 and 200

means that you get observations from 2001 to 2005, but only between the DOY 100 and 200 each year.

The query would look like:

SELECT * from data where (year between 2001 and 2005) AND (doy between 100 and 200)

To do continuous dates would require a different “date” parameter that has the 2 concepts combined, but I don’t think there’s enough of a need for that type of query (e.g. from 2001-05-01 to 2005-10-01, would include all of 2001 after May 1st, all of 2002-2004, and 2005 until Oct 1st).

The NC web site supports both types, but I think the API should only support what I called “seasonal” date filters, and not continuous ones.

denislepage commented 5 years ago

I generally would support the winter season query (end < start), though I’ll have to test how this performs on the data query side. OR statements may effectively create 2 separate steps in the query plan, but that would probably still be more efficient than having to make 2 separate calls.

steffilazerte commented 5 years ago

@pmorrill I know you're not working on BSC right now, but when you get a chance next week!

I'm not sure if this is related to the way DOY is specified, but I've been running into some odd DOY error messages.

For example, with the "RCBIOTABASE" collection...

I have no problem with these queries:

But these ones all of a sudden return errors:

But if I have both start year and end year, I'm get no error messages when I add DOY filters. It seems like the server doesn't like day of year without a start/end year, is that what's happening?

steffilazerte commented 5 years ago

Also, if we're going to allow filtering by day of year, it could also be useful to return day of year as a field in the data.

I can easily convert it in R for a data frame, but I'm struggling in my tutorials to show users how to do post filtering using day of year in a SQLite database. It would be much easier if it was already present as a field!

denislepage commented 5 years ago

We probably should just have a function in R.

We should look into using dplyr for those types of transformations.

steffilazerte commented 5 years ago

We can, it's very simple for data frames, just a bit trickier to go from survey_year, survey_month, survey_day to day of year all within a SQLite data base (i.e. trying not to collect any data).

But I can make a little utility function for that, it shouldn't be a problem.

denislepage commented 5 years ago

I think lubridate is your friend:

https://blog.exploratory.io/5-most-practically-useful-operations-when-working-with-date-and-time-in-r-9f9eb8a17465

https://stackoverflow.com/questions/34189367/find-day-of-year-with-the-lubridate-package-in-r

That might require collecting the data into a dataframe.

SQLite also has some date functions.

https://www.sqlite.org/lang_datefunc.html

Something like this would work:

select julianday(survey_year || "-" || substr("00"||survey_month, -2, 2) || "-" || substr("00"||survey_day, -2, 2))

e.g.

select julianday("2018" || "-" || substr("00"||"4", -2, 2) || "-" || substr("00"||"1", -2, 2))

Running something like that in SQLite would populate a new field, without having to load the data in R.

ALTER TABLE bmde_data ADD COLUMN doy INTEGER NULL;

UPDATE bmde_data set doy = julianday(survey_year || "-" || substr("00"||survey_month, -2, 2) || "-" || substr("00"||survey_day, -2, 2))

steffilazerte commented 5 years ago

Yup, that's broadly what I've got as the helper function (depending on whether you give it a data frame or a database connection).

denislepage commented 5 years ago

Nope. I was looking for a SQLite only solution, but whatever works.

pmorrill commented 5 years ago

I have no problem with these queries:

No filters: 12598 obs
Start Year 2000: 9372 obs
Start Year 2000 & start doy 120: 7706 obs

But these ones all of a sudden return errors:

Start Year 2000 & start doy 120 & end doy 300: Invalid endDay
start doy 120 & end doy 300: Invalid startDay, Invalid endDay

But if I have both start year and end year, I'm get no error messages when I add DOY filters. It seems like the server doesn't like day of year without a start/end year, is that what's happening?

Yes - the error trap is over-zealous in this case. I am adjusting it.

steffilazerte commented 5 years ago

Thanks for fixing the errors, it works fine for me now.

How likely is it that we'll get the option of selecting a winter range? Right now the API doesn't return any data if I try to use a startDay of 20 and an endDay of 300 (either is fine, though as would be expected).

Should implement this on the R side, or is it possible on the API side?

pmorrill commented 5 years ago

I will consider what is needed in the Java, but Denis also expressed concern about testing how efficient such a query would be.

denislepage commented 5 years ago

I did a bit of quick testing. Doing 2 separate queries is likely slower than doing an OR condition.

I would likely want to change the DataRequests.getSQLToFilter method to support this.

I can have a look.

pmorrill commented 5 years ago

Yes - DataRequests is where I would put it too. But if you want to create something in there, all the better!

denislepage commented 5 years ago

OK, I just pushed a change to the sql filters, which I haven’t tested yet.

Day of year is a new parameters that you added correct? If we decide to save the data request beans, we’ll have to decide how to deal with it. We could convert it to day/month values and do the reverse when loading, for instance. Or there may be value in adding it as a distinct parameter. I have to add a query type (web vs. api) at the very least.

pmorrill commented 5 years ago

I have added 3 parameters:

startDayOfYear endDayOfyear siteType (which is either null or 'IBA')

Just looking at your changes to accommodate the split-year queries

steffilazerte commented 5 years ago

I was just testing the DOY filters, and it looks like you've got the split-year queries going!

I may be jumping the gun here (are you still working on this?), but in case you have finished, I wanted to point out a small change: the logical operators should be switched.

For example,

(bscakn.dbo.DMY2DOY(survey_day, survey_month, survey_year) <= 300 OR bscakn.dbo.DMY2DOY(survey_day, survey_month, survey_year) >= 20)

should be

(bscakn.dbo.DMY2DOY(survey_day, survey_month, survey_year) >= 300 OR bscakn.dbo.DMY2DOY(survey_day, survey_month, survey_year) <= 20)

So that it grabs number from 300-366 or from 0-20.

denislepage commented 5 years ago

Yes, you are correct. I have posted the change just now.

steffilazerte commented 5 years ago

Great, works perfect on my end