IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Climatic data and tsibbles #5639

Open rdstern opened 4 years ago

rdstern commented 4 years ago

We make quite a lot in R-Instat of using tidyverse. We are not the only ones! There is now a tidymodels package that could be important for our own modelling menu.

But the main purpose of this issue is that the simplicity in our climatic menu depends on defining data frames as climatic data. I think they are tsibbles!?

If so, then I suggest it could be important (and valuable) for us to be fitting in with what R is doing now. It is clear that Rob Hyndman is a key person in time series modelling. And our climatic data are all time series. And he is the writer of the R time series task view and also the author of a new key (free and only online) textbook. That's interesting in itself! But even more interesting is that he is the main author and also maintainer of the standard package on time series analysis - called forecast. This package is is 5 task views and many other packages depend on it - and it is pretty specialised. And it is now replaced by his new system.

For example I was assuming - and still do - that we need to define variables as time series (ts), even though they don't cope that well with daily data (zoo objects do better apparently). But we have multiple time series objects, because we measure multiple elements (e.g. rainfall and tmax and tmin). Those are defined (in the now defunct - I think) forecast package as msts (multi-seasonal time series).

We have just those for our hourly data where the day and the year define 2 seasonal patterns. I think the may all be allowed for in tsibbles?

I suggest this might become automatic in Define climatic data. Also our Prepare > Define menu might become more general. It might allow us to define data frames as tibbles when they are tidy and as tsibbles when they are time series?

I think we will still need to be able to define ts (and zoo) objects too. And we should check our Open from library. I don't think it can yet read a time series object? I'll check for some examples.

dannyparsons commented 4 years ago

A few observations from quickly looking at the tsibble package:

The general concept is very nice. It would be nice to follow a standard like this. It does a good job at simplifying the R code, although that's not our top priority. The main danger is that we tie ourselves into this system and then realise it is too restrictive for our needs.

We should try to use this in R for a variety of datasets that we use and want to use in R-Instat to better understand if its fit for our uses.

dannyparsons commented 4 years ago

Other questions:

rdstern commented 4 years ago

Good spotting. I have been impressed in the extent of the changes in this new "system" compared to his previous package (forecast). His new book is also open and only online - also interesting. I am not sure he (they) have considered climatic. I wonder if it is worth writing to him (them)?

There is the factor issue, shifted year, daily having the option of 366, then dekad, pentad and weekly data. That's so far.

dannyparsons commented 4 years ago

If there is functionality we need which we think is also sensible generally then we can suggest it as issues to the package and see what the response is.

lilyclements commented 4 years ago

I had a little look at running as_tsibble through the calculator system in R-Instat.

I read time series data into R-Instat and created different time-index formats through the "Use Date" dialog to try out in the index parameter in the as_tsibble function. There were not any problems from what I could see, however, perhaps there is more I should be looking at. This was to just create a time series object, I did not actually use the time series object in any time series functions. Is it worth exploring this?

Other questions:

  • Can it cope well with our idea of a shifted year?

How can I explore a shifted year in R-Instat? When creating different time-index formats in the "Use Date" dialog for the index parameter in as_tsibble, I shifted the month to say, June, and could not see any problems However, perhaps there is more to explore than that when shifting a month.

A few other side notes:

  1. Would it be worth having the "Duplicates" dialog before the "Define Time Series" dialog? And/or would a button to check the key is unique in the "Define Time Series" dialog be sufficient? In R, if I create a key in the as_tsibble function which is not unique, I get the following error:

Error: A valid tsibble must have distinct rows identified by key and index. i Please useduplicates()to check the duplicated rows.

  1. If I import the data sets from the tsibble package, I get an error. I think this is due to certain variable types in the data (if I save as RDS and import, it brings up the error (see below). If I remove these following columns, save as RDS, and import, it imports the data fine): In tourism there is a variable of class "yearquarter", "vctrs_vctr" In pedestrian there is a variable of class "POSIXct" "POSIXt" Is this a known problem? I've seen a closed issue (#4072) on the "POSIXct", "POSIXt" classes, however, when importing from the library there is still an error.

The error message to 2. is: Cannot retrieve metadata Error: Could not retrieve data frame metadata from R. Data displayed in spreadsheets may not be up to date. We strongly suggest restarting R-Instat before continuing.

rdstern commented 4 years ago

I respond to some of the points: 1) I like the idea of the Duplicates before the define. Should we have the same Tidy and Examine menu (from climatic) first? It includes the Duplicates dialogue. 2) Should we therefore also include the Dates dialogues first as well? 3) Then instead of Define Climatic Data we have Define Time-series data?

On importing of the data from the library, could you check again, with the latest merged version? The importing was recently improved by @Patowhiz and it now seems to import the 2 data files from the tsibble package and the first data files from the tsibble data package.

lilyclements commented 4 years ago
  1. I'm not sure why a "Tidy and Examine" set of dialogs would be useful for only one of the structured define dialogs rather than all four. I don't know the structured data in the "Define Circular" or "Define Low_Flow" dialogs but looking at my initial "messy real-world" survival data I use in the PhD, some dialogs in the "Tidy and Examine" menu are relevant. However, I don't follow why some dialogs in the "Tidy and Examine" suite (e.g. "Visualise Data") would be more relevant to these specific structured data over a regular data set. Overall: I'm unsure why only one define dialog in the Structured menu would have a "Tidy and Examine" set of dialogs, but then is it a bit counterproductive to have a Prepare menu, and then a separate "Tidy and Examine" set of dialogs under the Structured menu?

  2. Would the "Dates" menus be useful for the data structures in the "Define Circular" or "Define Low_Flow" dialogs? Looking at the MSc project for the survival menu, it was suggested to have the set of dates dialogs before defining survival data.

lilyclements commented 4 years ago

On importing the data - I have updated my branch with the latest version and am still getting an error.

rdstern commented 4 years ago

On your item 1 I was following your suggestion that perhaps the check for duplicates would be useful in the Time series menu. I was then thinking that this is an item in the Tidy and examine menu in Climatic. So there is a sort of parallel between the climatic menu (where the data are time series) and this new menu.

So as the Check for duplicates is conveniently in the Climatic menu it means that users on climatic analyses can mainly use just the Climatic Menu, rather than having to go to the Prepare menu first.

I was not thinking of all the items from that menu. In particular the "Tidy Daily Data" would be omitted. That's the only special dialogue. All the others are taken from the Prepare menu. So in the Time series case there would only be the most used items from the Prepare menu and it might therefore be simpler to let people use the Prepare menu instead.

Or we just include items from Prepare that are particularly important for the time series analyses, with tsibbles. So, we still include Duplicates, because that is particularly important for these analyses. Anything else uses the main Prepare menu. Just including a single item (Duplicates) has another advantage that it doesn't need a sub-menu. It is just one more item in the special prepare section. I quite like that.

On that argument we would again leave the Dates dialogues on the main Prepare menu.

On the importing, Patrick has a pull request called "Detect List of Datasets". Could you try with that branch. It did work for me without that, but perhaps Patrick's new code has solved the problem. If not, then we can open a new issue.