IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Climatic dialogues and features - what's needed #1926

Closed rdstern closed 5 years ago

rdstern commented 8 years ago

The Lesotho work and also the Caribbean work provides a good incentive, with real customers, to ensure that the climatic features satisfy their needs. This can grow, but here are some initial ideas. I am also fairly sure we will be producing a "package", rather than largely making use of packages already produced by others. Though we will still be using other commands where appropriate.

  1. Padding the (daily) data. The data from Lesotho have missing years, and also missing days/months within the years that exist. It would be good to be able to pad. I am sure we wil find a general routine that does this, but not perhaps with all the flexibility we need. a) The Lesotho data come with three columns, for year, month, day (within month). I suggest (for simplicity) our "system" will be built on a single data column.
    b) So it must be easy to produce a date column, from the year month day. That is possibly already there.
    c) Hence, when we pad, do we just pad the date, or could we also pad the year/month/day columns at the same time?
    d) As an option, we may want to pad just for incomplete years, rather than add full years where the whole year is missing. e) If we do pad just for incomplete years, then it would be good (at least evantually to define what we mean by a "year". It may sensibly run from August to July.
    f) I have no problem people having to do 2 stages. So when we want just incomplete seasons (rather than empty seasons then we could pad fully and drop seasons where there is no data. That would tie well with my next feature concerning missing values in the data.
  2. Reporting on missing values. I am sure we will be able to find some existing commands for this, but we may have additional special features we would like. a) They can be both graphical - and our inventory plot is an example - and numerical. b) It should include a feature to be able to report when complete months are missing. c) We will often want to have a report on a single element - usually rainfall. But we also should be able to report on multiple elements when in the same data frame. Often all elements will be missing together. d) When more than one day is missing we want to distinguish between consecutive missing values and odd missing days. e) In some places we will want the report to be restricted to the rainy season - which we will define in terms of the first and last month (which could be "October" to "April".) f) this isn't strictly "missing" so could go into the next point. But sometimes a whole month being zero (or one particular value - usually it is zero rainfall) is a cause for concern. It might be missing and simply recorded as zero. Usually there will be specific months where "all zero" is OK, perhaps 2 months where it is "questionable" and other months where it is most unlikely. (For example in Southern Africa November and April could be questionable, while December to March are most unlikely.
  3. Filters. I really like the filter system in Excel - which is very visual - and wonder whether we can become close in R-Instat - where our filters are already pretty powerful. (And we might be able to do even more!)
    a) So in Excel, for any column (not just factor) I can see the distinct values in order, and then choose them. I did this for the Lesotho data, where the 5 or 6 lowest max temp then showed examples where max temp and min temp were clearly switched over.
    b) What I don't know how to do in Excel would be to be able to take the x lowest (or highest) within a station and then within a year (i.e. within a factor or two). That would be great! c) I would have no problem if this is in (say) two stages, so we produce a logical column first for true/false for the nth largest/smallest within a factor combination. (I think this function may be in dplyr?) Then the existing filter would be easy.
  4. Consecutive values the same. This is an interesting feature in RclimdexQC. And we should be checking there for more features that we would like. It is a sort of spell length for identical values, e.g. max temp = 21 for 8 consecutive days. So the important thing is that the value stays the same. We should be able to exclude particular values from the calculation, particularly zero rainfall. This could produce a report.
  5. Producing a report. When we find problems in the data, then can we quickly produce a report - presumably another data frame. This is linked to the main data frame and just contains the records where there are issues. This could be (at least in the first place) just for daily data.
    a) There could be the name of the parent data frame as one repeated column, b) Then the issue number - from 1 upwards. c) Then the type of issue. d) Then Year/Month/Day columns. The issue could relate to a year, in which case the Month and the Day columns would be missing. e) Then the data. This is missing for Month and Year issues. For day issues it could be a single day of perhaps more than one day, if that is needed to show the issue. f) Then a comment - typed by the user to explain a bit about the problem.
  6. PICSA. I am assuming that we will be having dialogues for the types of event used by PICSA. But there may be scope (like for corruption) to repeat the main dialogues, but also perhaps to have special versions of them.
    a) This is all part of "making it easy" for NMSs to contribute to PICSA. b) One aspect would be to make it easy to produce consistent graphs of the events for any given station. This could largely be a special theme, but would to some extent also be data dependent.
  7. CPT We need to prepare data for CPT. There could be a CPT menu - as the PICSA menu. This could include the import of SSTs, the PCs and CCAs and also the export of the climatic data for CPT.
rdstern commented 7 years ago

The Lesotho team has now left. The work since the suggestions above has confirmed the importance of those topics. In week 2 we looked at their data, using R-Instat. This was (for me anyway) the first time we have tried to use R-Instat for data analysis, rather than to test the software. It worked, and the Lesotho team plan to continue to use R-instat when they return.

So the list above really stands. The lessons for them will be similar in many other countries. They have already been using their climatic data for analyses, without realising some of the (obvious) problems in their data. And the boxplots in particular showed them how serious those problems are. For example they clearly had sections of a month where they had accidentally types max temps, when it should be min temps or even rainfall!

So the items above that relate to quality control remain really important, and we could usefully think of more. I had thought previously that we should concentrate on more general aspects of quality control. But my failure to find a good general package, plus the urgent need makes me feel that supporting QC for climatic data is an urgent need. Also (and unlike my previous message on multinomial) there is a set of tasks here that we can do quickly and well.

We should do this in relation to what exists already, see http://etccdi.pacificclimate.org/software.shtml for example and the presentation by Enric Aguilar, here https://www.wmo.int/pages/prog/wcp/ccl/opace/opace2/documents/Aguilar-Nanjing-2013-Presentacion2.pdf

Enric has visited us in Reading and will be very receptive to our suggestions - he also chairs the relevant WMO expert group.

There is one "interesting" addition that I would like to consider for the QC. This is to look for evidence of a big change in "pattern" in the data from one month to the next. This is because some errors that correspond the entering the wrong column automatically change from one month to the next. So that change could alert us to look at those changes in the data in more detail. (There are lots of alternatives here, e.g. single day jump when it is at the end of the month. Change in pattern for the last (say) 5 or 10 days, in relation to the equivalent days at the start of the next month.)

In addition the existing QC is designed just for rainfall and max and min temperatures. The data we will be getting from Lesotho will have the other variables. They include more temperatures (dry-bulb and wet bulb) and wind speed, etc. We should also think what is appropriate there. In particular the additional temperature variables are useful for further checks for the max and min temperatures.

rdstern commented 7 years ago

We are now starting work properly on the climate object (again) and also the climate menu. We expect this to start through David with Lily and Steve. Then including some of the Maseno team as soon as is appropriate.

Here is my understanding of where we are now:

1) We have moved on quite a long way since the original climate object in 2014. It is not clear how much will remain of that previous code. It also means that the new climate object will be usable (easily) through R-Instat, and also probably (in a similar menu-driven way) from ClimSoft. How easy it will be as a stand-alone product remains to be seen. 2) You would still (of course) be able to start in R-Instat and then continue in R - through RStudio. 3) The new climate object will be an instance of the corresponding Instat object. 4) Where appropriate it will build on structures that are already in place, or planned, in the existing R-Instat. 5) It will recognise "What", "When" and "Where", i.e. the element being analysed (e.g. rainfall) is "what", the date and time of the measurements are "When", and the station being analysed, usually including its geographical position (lat, long, alt) is "where". 6) The climate object will always include a (single) date column. R handles dates well and this part is in the main Organise menu (possibly repeated in the climatic menu). Where dates come in without a single date column, the date will be constructed. I am not yet sure whether this will be done by the user before the climate object is defined - I think so - to make it simple. The alternative is for the definition to include that step. (But that would be quite complicated to construct and also possibly confusing, because it would be a bit magical!) 7) The station information will usually be in the form of a factor data frame - which we already have in the Organise menu. It could (apparently) alternatively be in the attributes associated with a data frame, or with a column in a data frame. 8) In completing the general (Instat object) facilities, we still have to add dialogues to manage the links between data frames and the keys for data frames. "Date" will often be a key in the main data frame for the climate object. There could be alternative keys, for example "Date" and "Day"+"Month"+"Year". Where there are alternatives, we need to check that they are consistent. 9) A common "shape" for the data could be with "Date" being a unique key, and with information on different elements ("What"), and stations ("Where") being in different columns (variables). Alternatives are where the stations or the elements, or both, are stacked. In that case the Station+Date, etc would constitute the key. 10) We will therefore need checks that the key (or keys) are unique. We need to check this aspect also as part of our quality control of the climatic data - we could have the same data from different sources - so that will be one task. Other software for climatic analyses already do this, but I suspect we should think this our for ourselves. It could be a mistake, but could also be a useful feature, e.g. results from double entry, where we want to look for differences and report sensibly.This latter case should (usually) have another component of the key, e.g. to specify the source of the data. (I wonder what the situation is, in ClimSoft?) 11) I presume we will have a dialogue in Climatic > Organise to produce a climate object from an Instat object? As the climate object is being constructed it would be good to clarify what this will consist of? Then I assume most of the other climatic dialogues will assume a climate object. I assume we will be able to save climate objects, i.e. the corresponding Instat object might have more components (data frames, etc) that we don't need. Alternatively, when we produce a climate object it might have a tidying option included? 12) Most of the data frames in a climate object will be "regularly spaced", e.g. daily or yearly. A few will not, e.g. all days with rain, or all extreme days, defined as days with rain > 50mm, or all possible planting (start) dates between April and June. The regular spacing offers the possibility of "infilling" in different ways. 13) The "events" in the current Instat correspond to one of four options within the R-Instat framework. We consider them in turn, with the idea that each could be a dialogue in the Climatic > Organise menu. 13a) The first is Calculation. Examples are 3-day running totals, and length of dry spell on each day. These are components of some of the summary definitions. I still suggest that a simple (powerful) way of providing this dialogue is to add a climatic keyboard to our existing calculator and then to open it, with that keyboard as the default. An alternative could be to have a special dialogue with (perhaps) its own keyboard, or make it simpler by listing the possibilities (as we plan to do with our transform dialogue, and already do with dates and strings calculations). I like the idea of our general calculator, because then other calculations (e.g. Centigrade to Fahrenheit) would then all be easy.
13b) The second is actually a filter, but we might call it Condition. It gives all the days (or other rows) that satisfy a given condition. This could be derived from the general filter dialogue. The results could be expressed as a filter, and hence stay in the same data frame, or they could be put into a new data frame. This dialogue is less used in the old Instat, but it is roughly the same as is needed - as a general concept - for quality control. It may be that we do those options separately, partly because users will like to see a quality control menu title. But it is likely to again essentially be a filter. 13c) Summary. This is the "usual" one. In the current Instat we differentiate between different summaries, which are split over about 5 dialogues. It will be an interesting challenge to put them into a single dialogue. I look forward to that! 13d) Risk. This is new (and exciting - at least for me!). As a general concept it is looking at the seasonality, rather than the trend. Maybe Seasonality is a better name. I suspect that it may often follow the use of one of the other dialogues above. A simple example from daily data would be the chance of rain each day of the year - with a fitted curve? Or the mean rain per rain day. Typically it will result in graphs and tables where the months, or dekades, etc through the year are on the x-axis, while the Summary dialogue will lead to time series presentations. Nice to be able to suggest circular graphs here!

dannyparsons commented 7 years ago

Is this a useful post July set of comments to refer to?

rdstern commented 7 years ago

Yes. I am keen to work to July on the basic stuff we have, and then use this to discuss funding of more work, particularly in relation to applications that would also be useful for CLIMSOFT.

shadrackkibet commented 5 years ago

I think this should be closed now. Most of these ideas have been implemented. I suggest anything that remains to be reported as specific separate issues.