IDEMSInternational / epicsawrap

GNU Lesser General Public License v3.0
0 stars 0 forks source link

Package functions #1

Open dannyparsons opened 1 year ago

dannyparsons commented 1 year ago

This is what I'm thinking as the main functions in this package. These functions would then have corresponding Python wrappers accessed by the PICSA App:

Function Parameters Returns Stub Initial Final
station_metadata country A table with columns: station_id, station_name, latitude, longitude, altitude, station_type, date_range(?), elements(?), _last_updated :white_check_mark:
nearby_stations (not for initial version) country, lat, lon Get the list of “nearby stations” (data sources) from a given lat/lon with their data properties and type - station or satellite point, % infilled from satellite. One source is the default which is recommended as most accurate data for given lat/lon. A table with columns: station_id, station_name, latitude, longitude, altitude, date_range(?), elements(?), columns for data properties, default (boolean).
definitions country, station_id Get definitions used for summaries for a specific station. (Definitions will be provided and managed by Met Service.) Returns a named list with one element per summary. :white_check_mark:
annual_rainfall_summaries country, station_id, summaries (optional) A table containing all the annual rainfall summaries for PICSA e.g. start of rain, total rainfall, number of rain days, end of season. One row per year and one column per summary. :white_check_mark:
annual_temperature_summaries country, station_id, summaries (optional) Same as above but for all the annual temperature summaries for PICSA e.g. mean minimum/maximum temperature, extreme maximum temperature. :white_check_mark:
monthly_temperature_summaries country, station_id, summaries (optional) Same as above, but for monthly temperature summaries. The table would have one row per month-year. :white_check_mark:
season_start_probabilities country, station_id, start_dates A table containing the probabilities of the season starting on or before a set of particular dates. One row per start date, a column for the probabilities.
crop_success_probabilities country, station_id, planting_dates, water_requirements, crop_length, start_before_season (optional) A table containing the probabilities of crop success for given planting maturity lengths, seasonal total rainfall requirements, and planting dates. :white_check_mark:

For all of the summary functions the returned table would include a column for the station so the results could be done for single or multiple stations.

@chrismclarke Does this make sense as a broad structure for these functions and the sort of thing you were expecting for the API endpoints? This is a first draft so happy to adapt the structure if there are things which would make it easier to handle in the PICSA App.

@lloyddewit How does this look to you? Any comments or feedback? I will still need to define the function parameters more specifically for your work on the Python wrapper.

chrismclarke commented 1 year ago

Thanks @dannyparsons , looks very reasonable to me. I think the main use cases I'm aware of are the crop probabilities and rainfall/season/temp graphs which are covered.

The only possible addition I could think of would be whether the wrapper also provides access to media/reports generated by the station (e.g. forecast or bulletin), or whether these would be handled through a more central mechanism (could still have an API endpoint, just that links with a different service so not this wrapper).

Then as we start to get into implementation details a couple things that I'm sure you're already aware of:

date parameters - how best to format? in the past we've used numeric values for daynumber in a year although that can get a bit confusing in cases where the year starts from July 1st or similar (season spanning multiple years). My gut feeling would be to just allow a date object to be passed which would then have any year and time data stripped off (as not relevant), but can find a way to work with any system you come up with.

Return objects - similar issue as above that we will probably want to standardise as much as possible how to represent something like chart data. I'm guessing one column for the data values and another for metadata (e.g. column labels, definition used etc.) might be nice, but again can be flexible and plan as we go

Definitions - might be a question as to whether the functions accept a definition as one of the input parameters (with default to apply) or not. I can't see the app ever wanting to give user the ability to view under different definitions, but more just for your own utility if you ever wanted a dashboard where climate team could view multiple charts under different definitions

lloyddewit commented 1 year ago

@dannyparsons @chrismclarke Thank you for the detailed comments.

Parameters

@dannyparsons Please could you confirm the data type for each parameter? The more detail the better (e.g. valid ranges/values). My initial guesses/assumptions are:

Returns

@dannyparsons Each function returns a table apart from definitions() which returns a 'named list'. Please would it be possible for each function to return a table (so that we could handle each function in a standardized way)? I assume that each table will be returned as an R data frame. If possible, could the data frames contain only simple data types (strings, doubles, integers)? Date types are possible but add complexity (see below).

date parameters - how best to format? in the past we've used numeric values for daynumber in a year although that can get a bit confusing in cases where the year starts from July 1st or similar (season spanning multiple years). My gut feeling would be to just allow a date object to be passed which would then have any year and time data stripped off (as not relevant), but can find a way to work with any system you come up with.

@dannyparsons @chrismclarke In OpenCDMS, date types caused problems during the conversion between Python and R (e.g. the conversion automatically applies time zone conversions causing dates to sometimes shift by one day). I found workarounds but it increased complexity/risk. I would strongly prefer to use an integer day number for function parameters. Would it also be possible to represent dates in the returned data frames as integers?

Thanks!

chrismclarke commented 1 year ago

@lloyddewit - I think dates will always cause some complexity (particularly for the reason you said relating to how timezones are applied). There will definitely have to be conversion from date to day-number at some point, so really the question is where should that conversion lie.

I think relying on the app to convert the date to a number is probably more prone to error as it has less awareness of the methods to know how to account for things like leap years. I'd say a reasonable middleground could be to rely on a string-interpretation of a date in yyyy-mm-dd format, or even just mm-dd format and then handle the rest at the wrapper level (and avoid full date objects, because yes depending on iso format could differ on various clients, servers etc.)

lloyddewit commented 1 year ago

@chrismclarke I agree with your comment above. If the year is not required, then I'd prefer not to include it in the parameters.

I propose:

Would this be OK from your side?

lloyddewit commented 1 year ago

@chrismclarke Another question. How would the app like to receive warnings and error messages (either forwarded from the R functions, or generated by the wrapper itself)? Do you have any preferences?

lloyddewit commented 1 year ago

@dannyparsons It's difficult for me to continue work on the wrapper without an R function that I can use to test. As discussed, it would be great if an initial stub could be implemented for one of the functions above. Alternatively the following R test function would be very useful:

This would be a good test for the wrapper. We could also create an endpoint for the wrapper function and test the full roundtrip between app, wrapper and R.

Please would it be possible to implement such a function?

dannyparsons commented 1 year ago

Thanks for both the comments.

Forecasts I had forgotten about the seasonal forecast information, which is likely going to be PDF documents. I agree, it should definitely have an API endpoint. I don't see why this couldn't come via this R package too. Maybe it's more efficient if it came via another service directly, but might be simpler to keep it in R. I'll discuss this with Chris M and Stephen.

Dates I agree it's best for R to return date-like values instead of day numbers which could be misinterpreted. To keep these R functions more generally usable in other contexts, I would prefer Chris's suggestion to have dates returned as yyyy-mm-dd and have an option to return this as a string to avoid the conversion issues in Python. Equally, we could also strip off the year in R, but I feel like it's maybe safer for testing and checking to keep the year in the R variable and let the app strip the year, if that's ok.

Return objects Yes, I would like to have a consistent format for all the functions which return data for charts, and we can include extra columns to help display if needed. For the metadata, I see this as being returned in definitions so that you can get the exact definition used for each summary, probably as a set of parameters (and a description).

Definitions The functions will have parameters to provide custom definitions. For the app, my idea is that we have configuration files on the server alongside the data which specify the definitions to be used for the summaries. The Met Services would be responsible for maintaining the configuration files. Then the R wrappers will pull the definitions to use from the configuration files so that we have sensible definitions for the app. So yes, the app will not need to provide any definitions (but there will be the ability to override/supply definitions for other uses).

@lloyddewit I wanted to outline the general structure first before writing functions, but my next step now is to start creating the stubs in this package so that you can work on the Python side. I'll aim to have one stub with the parameters clearly defined and returns a data frame (maybe a dummy one for now) so that you can start testing.

For the definitions function, I'll have a think about named list vs data frame. I would have thought lists are quite easy to convert and manage in the R/Python interface, but I can think about this when I have a clearer idea of what exactly needs to be returned. We could maybe have an option to return list or data frame, but also want to check what will be easy for @chrismclarke to work with in the app.

dannyparsons commented 1 year ago

@lloyddewit I now have a function here https://github.com/IDEMSInternational/epicsawrap/blob/main/R/annual_rainfall_summaries.R

It returns a dummy data frame, but the station_id will be in the output so you can see if that's passing through correctly, the second parameter is ignored currently. There may be other optional parameters that we add to this function later, but it should be a good basis for the initial prototype.

lloyddewit commented 1 year ago

@dannyparsons Thank you for the test function. I added it to the Python wrapper and it seems to return the correct data frame. Currently, the function returns a random value in the rainfall column (seasonal_rainfall = stats::rnorm(60, 400, 70)). I am writing automated tests for the Python wrapper. These only work for deterministic functions. Please would it be possible for the function to always return the same values (e.g. rainfall could be year divided by 3)? thanks

dannyparsons commented 1 year ago

Sure, done in the latest commit https://github.com/IDEMSInternational/epicsawrap/commit/3f08a5830b038ae1fd50b6d5f7c51412e7974202

chrismclarke commented 1 year ago

@chrismclarke Another question. How would the app like to receive warnings and error messages (either forwarded from the R functions, or generated by the wrapper itself)? Do you have any preferences?

The most important part will be the error response code. The main ones would likely be: 400 - Invalid parameters/syntax from request. If possible forwarding the message that specifies the missing/invalid parameter (fast API should be able to do this out of the box) 500 - Server crash/unknown/other errors - any other error from the codebase forwarded.

As users won't be accessing the API directly there's no need to try and make anything particularly human-readable, useful if it can include some form of the original message for debugging/tracing purposes

https://loadium.com/blog/the-list-of-http-response-status-codes?gclid=Cj0KCQiAutyfBhCMARIsAMgcRJQC0RDJa_k5xJ6W9KB-gTubSTF3pmO-iilWZNtK0QTbYfpR9ilW4S0aAuJbEALw_wcB

chrismclarke commented 1 year ago

Thanks Danny, all of your response sounds very sensible to me. For the question you raised:

We could maybe have an option to return list or data frame, but also want to check what will be easy for @chrismclarke to work with in the app.

The main thing will be some sort of JSON representation. I expect using whatever tooling's toJson method should be fine (and keeps the door open if later it also has fromJSON or similar. I expect that I'll implement one final reshaping/tidying level of scripts my side anyway, so that you can focus on keeping as much fidelity as possible with the data structures your own scripts use (the wrappers don't really need to explicit awareness of the app, but do focus on being externally accessible in a non R/python environment, i.e. web request).

dannyparsons commented 1 year ago

@lloyddewit I've edited the table above to include a country parameter for each function, which I had forgotten.

I'm now working on having stubs of each of the functions so you can continue with the Python wrappers.

lilyclements commented 1 year ago

I have created an additional function total_temperature_summaries, but this will not be called into Python. This is just used as a generalised version of both annual_temperature_summaries and monthly_temperature_summaries. Hence, annual_temperature_summaries and monthly_temperature_summaries are wrappers to total_temperature_summaries.

If we don't want to take this approach, I am happy if to remove total_temperature_summaries and just have the code separated in the annual_temperature_summaries and monthly_temperature_summaries.

lloyddewit commented 1 year ago

@lilyclements This approach sounds fine to me. I understand that in R, it's possible to specify a function as internal to the package (private) so that it is not visible/usable to external package users. Could we specify total_temperature_summaries() as internal?

lilyclements commented 1 year ago

For season_start_probabilities I note that start_dates is a parameter. I would have expected this parameter to be read in in the json definitions file. What value does this parameter give? Similarly so for the additional parameters in crop_success_probabilities

chrismclarke commented 1 year ago

Naively I might guess that we are expecting eventually people can provide their own definitions for testing different cases, although for the initial app working with a single definition is also fine

lilyclements commented 1 year ago

@chrismclarke @lloyddewit I wanted to clarify a couple of points after a chat today with @volloholic @dannyparsons

Reading in Parameters vs Definitions We read in the rpicsa parameters by the "definitions" file for annual_rainfall_summaries, annual_temperature_summaries, monthly_temperature_summaries The brief at the top of this page said to read the parameters into the epicsawrap function in season_start_probabilities, crop_success_probabilities.

I have so far been reading them all in through the json definition file. I have therefore just added in the parameters as a secondary option - so if the parameters are NULL we read in the definitions file, otherwise these are now parameters (see PR #17, #18).

We are thinking it would make sense, later, to introduce this to the other functions (annual_rainfall_summaries, annual_temperature_summaries, monthly_temperature_summaries). What are your thoughts on this? It will not be a priority for the deadline, but your opinion would be useful.

Reading in Variable Names from the Data I will add a function that globally checks the variables in the Climsoft data and "assigns" which one is the date, doy, year, etc. This will assume that the variable names are the same across all stations. I.e., the date variable is always called "date", say, in all variables. This approach was decided, for now, with @dannyparsons and @volloholic. Do either of you have an opinion on this, or any correction that would go against this approach (at least for now).

A bonus question I just noticed One thing I have just noticed: do you want planting_date and start_dates to be date variables? I have been assuming DOY, but this is all straight forward to fix. We can have it work for both. (If this is the case, I'll add in a check in the function for dates. If they are dates, but the SOR/EOR is a doy then we transform one, or vice versa).

chrismclarke commented 1 year ago

Afraid I can't add much to this as I think we're talking about the interaction between python and R scripts (and possibly to climsoft) which is all quite far beyond the API that the app will interact with.

All I can say from the app side is that at some point in the future it might be nice to be able to provide definitions with the request so that for example climate partners could explore possible graphs with different definitions, but that's only a nice-to-have and happy with whatever works for now for the app.

lloyddewit commented 1 year ago

@lilyclements Thank you for the questions.

Reading in Parameters vs Definitions

I think this decision should be made by @volloholic, @dannyparsons and yourself. I am happy to wrap whatever you decide. For me, the important thing is that you provide test examples that cover all the function options. I will then run each test example in RStudio and ensure that the wrapper function returns the same result.

Reading in Variable Names from the Data

Sounds OK to me, I can't think of any problems this would cause for the wrapper.

do you want planting_date and start_dates to be date variables? I have been assuming DOY, but this is all straight forward to fix. We can have it work for both.

DOY is much better from the wrapper perspective. This is because it is an integer and easy to translate. Date types are problematic because in the past Python has tried to automatically convert dates to the local timezone (and if it assumes that the time is midnight it can change the day, e.g. 01/01/2000 becomes 31/12/1999).