IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Including the QC from Rclimdex and Rclimdex extra in R-Instat #2392

Closed rdstern closed 6 years ago

rdstern commented 7 years ago

Aguilar-Nanjing-2013-Presentation2 QC R-CLIMDEX.pdf Manual_rclimdex_extraQC.r.pdf

The quality control in R-Climdex is fairly clearly shown in the presentation, which is the first of the files above. It is by Enric Aguilar, who works in this area. It produces a whole set of files for each station, for users to examine. It also calculates the 27 indices from WMO.

These indices are a separate dialogue for us. They use the climdex-pcic package that is in CRAN. I strongly suggest we make use of this package, even if we can produce those indices ourselves. This is also the recommendation from one of the original authors, from Canada.

In contrast, the qc procedures are written in R, but are not a CRAN package and were written a long time ago. I wrote to Enric (who visited Reading earlier and I also met in a subsequent workshop.) to ask his advice. His reply was also sent to Patrick Parrish, from WMO, who we will meet in Reading in January. It is below:

"First (suggestion from me) ) is to keep the results within the same software environment, rather than writing output files. : this is the way it should go. 100% agree. RClimdex was written 15 years ago (time flies); it started as Climdex, with excel files, and then the "R" was added, so there was a story of using external excel files. Besides, we were ensuring that computers with limitted resources could work. The extra-qc was a plugin done very quickly and I had no time to think of a better interface.

A couple years ago, I wrote a piece of code called quickqc.R, to extend the capabilities of extra-qc to whole networks and add a few more capabilities. I have a descriptive power point, only in Spanish, but I know this is not a problem in your capable team."

QUICKQC.pptx

"Second is to broaden the checks to examine all climatic elements.

We did some attempts to do so for the UERRA project. I coded some QC for pressure data. There is no way you can use it, as it needs very specific data, but I am attaching it, as it can give you some ideas about what to look at. I know some other colleagues also coded QC for wind. I know that the guys at the SMN Argentina (Dr. Maru Skansi, mms@smn.gov.ar) did a tremendous work on this."

quickqc_v2.0.zip

So that is the basis for my suggestions below

rdstern commented 7 years ago

I suggest we start by having a dialogue (or more than one) that does the checks that are at least done by the existing rclimdex-qcextra. But we consider carefully how to do them. They produce two sorts of output:

a) Boxplots, either by year, or by month. We could do both and use facets to make them clear.
b) They also produce time series plots for the temperatures, which we might do too, again possibly with year as a facet, while they produce them as long plots for each set of 10 years.

c) The second type of output is numeric, with the results of a given check each in a separate file. We could do this by having a separate comment for the rows that correspond to each check. Then we should be able to filter on those rows.

d) the one "method" I am not sure how to make easy in R-Instat is that I would like to nominate particular "months", from the daily data, and then be able to present the data for those months. This is because the data entry for daily data is from monthly sheets. So, I would like a check where I could nominate that this month should be examined, or perhaps a range of months - but then one month at a time.

The 8 text files produced by CLIMDEX- EXTRA_QC are as follows:

  1. duplicate.txt, which is for duplicate days in the file. This is like our test for uniqueness for key columns. I suggest we make that into a more general test than climatic.
  2. tx_flatline.txt and tn_flatline.txt are for consecutive days for Tmax and Tmin where the values are identical. It notes how many bare identical and the default limit is 4 days for reporting. This could work for us for any element, particularly if we have the option of having a value that is ignored in the sequence. So, for rainfall, we would ignore zero. Do we ever want to ignore anything else?
  3. Tx_jumps.txt and tn_jumps.txt are for differences between consecutive values. They take 20 degrees as the limit, which seems very large. We would want to be able to take a smaller value.
  4. Outliers.txt. I think this is taken from the monthly boxplots, and notes the values that are extreme.
  5. tmaxmin.txt is for all days when Tmin>=Tmax. Of course we could generalise this, for example to look at all days when Tmax<Tmin+2 say.
  6. toolarge.txt reports rainfall values >200 and temperature values > 50. It also reports too large a value for the diurnal range (Tmax-Tmin) but the threshold was not stated.

In our work we need to be able to prepare the special file for the (odd) rows and then we should be able to filter the data using checks from this file. We also need to be able easily to delete the particular checks, so we can change thresholds easily.

This will need some thought. It could be incorporated in a good MSc project, which could add more checks based on different formulae.

rdstern commented 7 years ago

I now see more clearly, first how this could be done, second that it may be quite easy, and third that it is important to do it soon. The importance is for Lesotho in the first instance and also potentially for Malawi. In both cases Stats4SD has some funds and a bit of them could legitimately be used to make these features have high priority. Also like the climdex dialogue, this will have high positive publicity, because it will cover most if not all the features in climdexqc and climdexqcextra. Plus some more I hope.

Currently data entry staff are used to do the data entry using CLIMSOFT or CLIDATA. At the other end of the work, results from the historical data are discussed with farmers, who understand easily. I claim that the data entry staff could easily use R-Instat. If the data checks could be an easy-to-use set of dialogues in R-Instat (and perhaps later in CLIMSOFT) then they can be done by the data entry staff. This would be interesting for them, it doesn't require high-level skills and could support improved data quality in the historical climatic database.

There is another potential group of users. These are the NMS staff and volunteer staff at the individual stations. In the longer term it would be excellent if they could use R-Instat easily to deliver products locally. But in the shorter term could they (if needed) be added to the team handling data entry and quality control of the climatic records - perhaps particularly for local stations.

We have always thought this for simple products - and teach it in e-SIAC. Now, if R-Instat has good data checking facilities it will be interesting for the staff at the individual stations to be able to use these also.

With the old Instat this addition would have been a formidable task. I suggest in R-Instat it is simple, because of the power of R and the structures already in place in R-Instat. So let's do it! I describe how in the next comment.

rdstern commented 7 years ago

See below for an updated comment:

The dialogue has a set of our buttons across the top. They could be initially be: Large, Same, Wet Days, Dry Month, Outlier, Jump, Difference. There is the usual data selector. And the usual receivers with fields Station, Date, Year, Month, Day, Element. They are filled automatically. If Maxmin is the button selected, then there is a second data field labelled Element 2.
In these checks the options wet days and dry month are only for rainfall. So the rainfall column could be filled then. Perhaps it is filled by default also for Large and Same. The other options apply more for temperatures, so perhaps the default could be to put They each are essentially a filter of the observations that satisfy the criterion and hence should be examined in turn.
The filtered rows are put into a new data frame (like the comments data frame). One feature is that from a given set of data the default is to add successive rows to this same new data frame. But you could choose to put the rows into a new data frame if you wish. So there is a field with a label something like Check Data Frame name: And the default name of that frame. That data frame has a number of standard columns from the data, namely the fields in the dialogue. There is then another variable which is the name of the check. The columns give each of the elements used. That could be all of the elements (at least for now). There is an Options button that takes you to a sub-dialogue: This could have a main sub-dialogue, plus possibly some tabs. Initially there could be boxes with a label for each option. Perhaps it only need show the options for that main dialogue button. Better would be for all to be visible, but only the chosen one to be enabled. Large in CLIMDEX is 200mm or more for rainfall and 50 degrees for other elements. It could usefully be given as 2 values with Rain between 0 and 200 being the defaults for rainfall. (so -99 would also be captured). And perhaps between 0 and 50 for the other elements as the default. The Options for the button on Same is 2 identical values omitting 0 for rainfall or sunshine. Default is 4 identical values for other elements, but the option can change this number. The button for Wet days is just for rainfall. It is the spell length for consecutive rain days. Usually the spell is as given in the spells dialogue, and here between 0.05 and 200, so it gives consecutive rain days. Report when there are too many (so perhaps the data entered were really temperatures). Default is 10 days. Dry month doesn't have any obvious options. It calculates the number of rain days in the month - days with values of 0.1mm or more. It notes if this is zero. It might note the 31st day of the month, (or the day number could be NA). Jump is when the difference between one day and the next (in tmax or tmin, etc) is too large - in absolute terms. 20 degrees is the default. Difference is the difference between tmax and tmin (or 2 other columns) Default is when difference is zero, but can choose this number. Outlier is the only statistic that is a bit more difficult to calculate. The option is Number of Interquartile Ranges. Default is 3. Calculation is on a monthly basis for an element within each station. Calculate lower quartile (LQU) and Upper Quartile (UQU). Difference (UQU-LQU) is IQR. Then an outlier is when a high observation is more than 3 IQR above UQU. A low outlier is when it is less than 3 IQR below LQU.

Finally - for discussion - is that this dialogue will produce a Comments-type data frame at the daily level. It is a filter, providing information on those observations that are odd. It would also be useful to have a summary, because the sheets used to enter data are at the monthly level.
This summary could be done perhaps through the dialogue. But otherwise it might be the standard summary dialogue. I like the idea that the quality control process is essentially a filter plus a summary. One reason for having the one dialogue do both is that a few checks are at the monthly level, particularly dry month. This dialogue (and possible summary) would be designed for daily data. Later we will have within-day data to provide checks for, e.g. for the synoptic data that is being computerised in Ghana on a 3-hr basis. And, of course, for the satellite data and other data from automatic stations. There the sheets, when they exist, are on a daily basis. There might then be 2 useful summary levels, with one to a daily basis and the second to the months. Interesting how months are artificial, but seem natural here.

And another finally is to remember that (at some stage) we will return to discuss CDT with Tufa and others at IRI. This will add spatial checks to the mix. Interesting!I suggest, however, that we wait till we have these initial checks first.

rdstern commented 7 years ago

In the outlier calculation above I had assumed this would not be for rainfall. However the adjboxstats function will work sensibly for rainfall. This is the stats from the adjbox graph in the robustbase package. We won't (yet) have the graphs, because they don't use ggplot - though there is discussion. But we can use this to find the outliers.

For reference there is a similar function called boxplot.stats, which could give the statistics mentioned above in the package called grDevices. I am not sure it is worth installing that package just for the one function, but if we are already using it, then that could make life a bit easier.

dannyparsons commented 7 years ago

Could it be better to have separate dialogs for rain and temperature? Seems like there's quite a few element specific options and the wet/dry buttons at the top. Or maybe a general dialog accessed from 3 menu items: rain, temp, general where it displays differently for different elements? Not sure if this is sensible but just had that idea from reading through the layout.

rdstern commented 7 years ago

I wondered about that. But there is sunshine that also has zeros, and it is more than rain and temperature. The other thing I found - through James - is the facility to have limits based on boxplots for rainfall. So there isn't so much that is different. For now we will just use it for rainfall and temperature, so I would go along with whatever is simpler to get working.

rdstern commented 7 years ago

Following discussions with Danny and David I now suggest as follows: 1) There are separate dialogues for the different types of data. So, in the Climatic > Check Data menu there is a line under Boxplot. Then the first item is a dialogue called Rain. The next is Temperature. There will be more, but that is all for now. 2) They will have a similar structure, but I am not sure (yet) they will be similar enough to warrand special custom controls. 3) There are not the radio buttons at the top. Instead there will be check boxes for each check. This will enable multiple checks to be done in one use of the dialogue. 4) They will produce items in the Comments data frame associated with each data frame. So useful to have the Comments "system" working already. 5) The Two dialogues will have the usual selector. And the usual receivers with fields Station, Date, Year, Month, Day, Element. They are filled automatically. In the Rain dialogue the Element is filled with the Rain column. In the Temperature dialogue the data fields are labelled "Element 1" and "Element 2". In this dialogue the default is to put Tmax into Element 1 and Tmin into Element 2.
6) Note the reason I am calling them Element and not Rain etc, is to tempt other elements when appropriate. For example there are many other temperature variables that are sometimes measured, e.g. wet bulb, or Tmean. 7) On the lhs is a series of check-boxes, one for each type of check. If there is very little information needed for a check, then all the information is there. Otherwise there may be an Options button leading to a sub-dialogue. This may become tabbed if there are many options. 8) For rainfall the 5 check-boxes have labels Large, Same, Wet Days, Dry Month, Outlier. 9) Large, if checked gives a single limit (that can be changed) with a field holding a single value. Default is 200, followed by a label "mm". It also checks that values are all non-negative. 10) Same checks if consecutive values are the same. If checked it shows "for [up-down] field consecutive values are the same (ignoring zeros). The default is 2 and the minimum is also 2. 11) Wet Days is the spell length for consecutive rain days. Usually the spell is as given in the spells dialogue, and here between 0.05 and 200, so it gives consecutive rain days. Report when there are too many (so perhaps the data entered were really temperatures). Default is 10 days. Again an up-down control. 12) Dry Month calculates the number of rain days in the month - days with values of 0.1mm or more. It notes if this is zero. It might comment on the 31st day of the month, This is an example where ideally it would check all 12 months unless told otherwise. This could be on a sub-dialogue which would allow ticks on the months to be checked. Default is for all months to be checked. 13) Outlier is different for rainfall than for temperatures, because it uses the robust measure of scale. Leave this for now, or look at the robustbase package for the limits.

Temperatures in the next comment.

dannyparsons commented 7 years ago

There's definitely some work to do to get the code for this completely working, but this is clear enough that someone could do the design now for the Rain dialog. @maxwellfundi you might want to leave this for a new intern if someone is starting soon.

maxwellfundi commented 7 years ago

@dannyparsons @muthenya has now taken over this.

shadrackkibet commented 7 years ago

@dannyparsons @rdstern we need to discuss functions to be used in this dialogs .

dannyparsons commented 7 years ago

This will use the general summaries system in some form as it's just doing filters and summaries, I think. And this needs to link with the comments data frame idea which we also haven't implemented. We will try to find time to discuss this soon so that someone can start on the dialog code.

Muthenya commented 7 years ago

This is the general design of the rain dialog image

rdstern commented 7 years ago

That looks very nice for the rainfall. For the temperature(s) there will be Element 1, which is filled with Tmax if it finds it. There is (perhaps) a checkbox with Second Element. Default is checked. If checked, then Element 2 filled with Tmin if it finds it. (OK is not enabled if it is checked until a variate is put there.)

The Large in the case of Rainfall is replaced by Range: Element 1 (Up/Down or field to type a number) then "to" and another field. Default is 0 to 50. (It can be less than zero - perhaps the acceptable minimum is -50. Maximum is 65. If there are 2 elements then there is another line for Element 2 - with the same fields.

The Second option is the same as for the rainfall. Called Same: But here the default is 4. The third option is Jump: Element 1 Field with default value of 20. Can be anything between 1 and 25. If there is a second element then Element 2 with the same default. The 4th option is Difference: (Element 1 - Element 2) < Another field with default 0 and anything between -5 and +5 is acceptable here. The fifth option is Outlier: Number of IQRs Field with a default of 3.

dannyparsons commented 7 years ago

@rdstern When is this needed for?

rdstern commented 7 years ago

I'd really like this as part of our work for Lesotho, and possibly therefore in time for a visit there in early December.
It needs the comments ideas though, so may have to wait I suppose to January?

I would put it lower in priority than the )different) dialogue of the CLIMDEX climate change routines, largely because I hope getting that working properly should be quite quick and would be very useful - at least in motivation.

dannyparsons commented 7 years ago

Ok that sounds reasonable. I have moved it to an unknown date until we have a fixed milestone for it.

rdstern commented 6 years ago

I have moved this back to milestone 0.4.7 now with a question. Could we get this dialogue (and perhaps also one for temperatures) at least partly working soon. My idea would be that we split the problem into 2 parts. First we implement the procedures, but simply writing to the output window. Later we can implement the comments.

The dialogue already looks interesting, and I hope the code behind will be simple, at least for a few of the checks. The results would already be very useful in the output window.

dannyparsons commented 6 years ago

I think that since it will just use our calculation system then the R code is no more difficult than other climatic dialogs and displaying in the output window would be simple. If it will be useful for EUMETSAT workshop then I think we should do it, otherwise it could be done shortly after.

rdstern commented 6 years ago

It would certainly be of some use for EUMETSAT, because we will be having data sets - and good to check. It will also be particularly useful immediately afterwards, i.e. to take to Lesotho.

rdstern commented 6 years ago

This is a topic - both for rainfall and for temperatures that I hope we can make some progress on this week, and maybe "finsih" some aspects next week. Those options that are complex we disable, but it would be good to get some of them working. Almost all are essentially filters. It would be good to start on some of them by Friday, so Danny/David can advise on the R part.

dannyparsons commented 6 years ago

These dialogs should now be easy to add the code to. It is very similar to the Peaks option on the Extremes dialog in that it is just doing a filter on the data. It should use our calculation system. One difference is that here, there is no grouping by Year, only by Station.

Please read again here for Rain dialog https://github.com/africanmathsinitiative/R-Instat/issues/2392#issuecomment-329208762 and here for Temperature dialog https://github.com/africanmathsinitiative/R-Instat/issues/2392#issuecomment-331951186 to understand what the different filters we want are. Discuss with @rdstern for any further clarification.

Below are the definitions of the filter expression for each checkbox on the dialogs. The ones not mentioned below can be disabled for now - they need more thought.

Rain

Temperature

If multiple checkboxes are checked, then the filter becomes an OR. For example if Range and Jump are checked on Temperature dialog, then the filter would be: Element1 > r1 & Element1 < r2 | abs(c(NA, diff(Element1))) > n

So even if it's multiple checkboxes, it's still one filter. This should all be possible using a series of ROperators and RFunctions.

I'm assigning this to @shadrackkibet since he did the Extremes dialog before and @Muthenya since he did the designs for these dialogs so you can work together on both dialogs. This is a 0.4.8 task now so can be your priority.

Please check you understand this today and get back to me if not. @rdstern can clarify any points on the data and filters after today, but I won't be able to help on the dialog code until I get back.

shadrackkibet commented 6 years ago

One difference is that here, there is no grouping by Year, only by Station.

Now ,what happens when we do not have a station in the data set? for example in the Dodoma data set?

dannyparsons commented 6 years ago

Now ,what happens when we do not have a station in the data set? for example in the Dodoma data set?

That's just one station then so there's no need to group. So on the dialog, the Station receiver is optional and the grouping sub calculation isn't needed.

dannyparsons commented 6 years ago

This now has an initial implementation but more is needed and some info from this issue is still useful.

rdstern commented 6 years ago

One small problem to be fixed in the next version is the limits, i.e. the first check of the temperatures. Currently they give all values within the range. They should give the values outside. So change for example tmin > 0 & tmin < 50 to tmin < 0 | tmin > 50.

For now a "work-around" is to get the script and make the change there. Works nicely!

rdstern commented 6 years ago

It is excellent that this dialogue has been implemented. I hope the work on completing it further can continue this week.

rdstern commented 6 years ago

Edits on the rainfall dialogue:

  1. The default for Large is 200 which is fine. But it says 0 in the up-down. Change it to match the code.
  2. Also add a check of negative values automatically Val > 200 | Val < 1E-8.
  3. The label could perhaps say (mm (or negative))
  4. Change the label Consecutive to Same - as it is on the temperature dialogue.
  5. Change the label Wet days to Consecutive and after the up-down make the label rain days, rather than just days.

Edits on the temperature dialogue:

  1. The condition on the first test has to change as mentioned in the comment above.
  2. Implement for the second column when it is included. In the labels for the elements on the right after Element 1 put (Tmax) and after element 2 put in brackets (Tmin).
  3. Move the Element 1 etc left a little (and/or make the dialogue a little wider, so that after the boxes you can include mm.
  4. Add units after the others. For Same it is days and it is mm for all the others.
  5. For Jump the default could be 10 for each element.
  6. Get rid of Element 1 and Element 2 for the jump test, i.e. the label is just Jump. But in the code (R command) make it clear that there are 2 elements there, with the same value. Then (in the rare situations that we want to change the limits for one of them), we do that through using the script file.
  7. The Difference checkbox is only enabled if the second element is filled.
  8. As described in #4255 add radio buttons to the top of this dialogue. The dialogue should be about the same size as now, with this addition, because we have saved a line through item 6 above.

Both dialogues: This may be more complicated - is it possible to add an extra column to the filtered rows of data. It has the name Test and it gives the same label as in the dialogue. It could be made into a factor column. Then it is clear what test was not satisfied in each case.

shadrackkibet commented 6 years ago

Also add a check of negative values automatically Val > 200 | Val < 1E-8.

when you say automatically do you mean adding 1E-8 by default? or we make it optional to the user which will warrant another control ?

shadrackkibet commented 6 years ago

Get rid of Element 1 and Element 2 for the jump test, i.e. the label is just Jump. But in the code (R command) make it clear that there are 2 elements there, with the same value. Then (in the rare situations that we want to change the limits for one of them), we do that through using the script file

@rdstern The current command for this is abs(c(NA, diff(Element1))) > n How will the R command look now?

rdstern commented 6 years ago

Ooops, in the QC check for rainfall above, I put Val > 200 | Val < 1E-8 as the idea, if the user put 200 as large. This would include all zeros as odd and you can have dry days (even in England!) It should have been -1E-8. To make it clearer perhaps the label could say Large (or Negative). If that doesn't fit neatly on the one line, then put it as: Large 200 (or Negative Value) i.e. on 2 lines and then could make it even clearer by adding the word Value.

Patowhiz commented 6 years ago

@shadrackkibet 1 ) This is more of a question and a suggestion . In the Define Climatic Data dialog we have Minimum Temperature and Maximum Temperature , is it possible to have them pre filled by default to element2(Tmin) and element1(Tmax) respectively ? I can see the Station and Date are being pre filled by default or has that been omitted on purpose? especially after the user has defined the climatic data.

Comment by RDS: I would like that very much. Except that the QC is for temperatures overall. Now the most important are Tmax and Tmin, but many stations also measure other temperature elements, particularly dry bulb and wet bulb that are used to calculate humidity. In agrometeorological stations there are often measurements of ground temperature, and temperatures at different soil depths. And satellite data now also measures ground temperature.

I was told that if a field is pre-filled, then (with our current code) that is all that can be used in that field. When we have more time perhaps there could also be code so we can pre-fill with the most used option, but also permit others to be substituted when we wish.

2) In the Acceptable Range (Element1) and Acceptable Range (Element2) , I presume the "mm" after their respective ucrNud controls are unit of measurements. Is "mm" a correct unit of measurement for this dialog? Or does it refer to something else other than millimeter. I presume the same applies to Jump and Difference. Temperature measurement are recorded in Kelvin or Celsius.

Comment by RDS: Humpf - well spotted! mm is perfect for rainfall. Temperatures are usually in degrees C. If we can give symbols then a little superscript circle is often used for the word degree.

3) This is more of design. @rdstern now that we according to the current dialog design Element1 will always be Tmax and Element2 will always be Tmin, why can't we just name Acceptable Ranges checkboxes to Acceptable Range(Tmax), Acceptable Range(Tmin). And also remove the naming Element1(Tmax), Element(Tmin) to remain with just Tmax, Tmin as labels of their respective receivers? I think that will make the design much simpler and straightforward.

Comment by RDS: Happy to add the (Tmax) and (Tmin) after the element name. I would likie to make the point that the temperature element could be other than Tmax and Tmin as explained above. But perhaps having made that point earleir, i.e. I would like to leave the main field as Element 1 (Tmax), then perhaps having just Tmax here would be OK? It will usually be Tmax - and that would make the labelling simpler.

4) @rdstern Still on labelling from my understanding, we are using Acceptable Range to mean that we don't want values of ElementX from .. to ... . I find that a bit confusing cause in that case we are simply defining Unacceptable/Undesired Range (any of its antonym) , which clearly expresses the intention.

Comment by RDS: The acceptable range is those values between the two limits. So I am here quite happy that -50 to 50, say, is the acceptable range. So we will be noting anything that is outside this range as unacceptable.

5) @shadrackkibet I can see the default lowest and highest values for Tmax range is -50 to 65 respectively . But for Tmin its -50 to 50. Is that intentional? I thought they are to be the same.

Comment by RDS: I forget what I suggested. I think they could be made more different instead, to emphasise that night-time temperaures are usually lower than day-time maximum temperatures. Let's make Tmax -35 to 65 and Tmin as -50 to 40. (And I hope users in Africa will not want Tmax to be so low - that's for a Canadian winter! 6) From my understanding of Difference, its meant to get the difference between Tmax and Tmin, if that's is its true purpose then I think it's odd to have its highest value set as 5. I could want to have a difference of 10.

Comment by RDS: The default here should be zero. Then, if you are considering the up-down limits I would have -10 and 10 as the sort of limits on the control. Incidentally I find the standard up-down control to be painful sometimes. If we are really having our custom controls, then I would love one that: a) Makes it very easy to type a value in. b) Have an option for some uses that you can choose to type a value that is different from those given automatically by the up-down. c) To me the usual control is sometimes a convenience, but in other occasions it is a restrictive pain.

7) Currently a new data frame is created with data after the "filtering" has been done. The naming is done automatically, @rdstern I think it will be much better if we had a ucrSave control for a user to get control over the naming of the new data.

Comment by RDS: Very happy with this as an improvement. I am still waiting for our facility to be able to add Comments to a data frame, see #2300. It should eventually link to that facility perhaps. Currently we are sorting out the initial QC without the option of having a comments data frame. As you see, that has been on our wish-list for a long time.

8) TheGMetData2.RDS (Ghana_Data) has its Tmax and Tmin data in 1 decimal place (float values) yet the filter ranges(ucrNuds controls used here) only accept whole values (integer values). Is this meant to to clearly tell the user that you can't get ranges and difference in decimal numbers. For instance if I didn't want values that are in a range of 21 to 22.4 probably because I want to assume 22.4 as 22 and 22.5 as 23 . This won't be possible.

Comment by RDS: Yes we should change that. We could want to set the difference to 0.5 degrees perhaps. But only rarely. It is a good example where I would prefer the up-down to go in whole degrees, but you could type any value. In fact I would be just as happy by having the default and you type a different value if you want one. I find our current up-down control painful enough that I would prefer to type.

rdstern commented 6 years ago

I tried the QC Temperature analysis with a set of data from Lesotho. The option I tried was the difference. I left the setting of 0 as the default. The error message was as follows with the important bit in bold:

Error running R command(s)

Error in parse(text = x, keep.source = FALSE) :

:2:0: unexpected end of input 1: ~**(Tmax - Tmin) <** ^ The error occurred in attempting to run the following R command(s): grouping <- instat_calculation$new(type="by", calculated_from=list("TempData"="Station")) temp_filter <- instat_calculation$new(type="filter", function_exp=**"(Tmax - Tmin) < "**, calculated_from=list("TempData"="Tmax"), manipulations=list(grouping), save=2, result_data_frame="Temperature_Filter") InstatDataObject$run_instat_calculation(calc=temp_filter, display=FALSE) OK The expression isn't complete. It has omitted the default of 0 (zero). I then ran it again and changed the 0 into 1 and it ran fine. The expression was then "(Tmax - Tmin) < 1".
rdstern commented 6 years ago
  1. In the temperature QC the labels should be changed from mm (which is for rain to deg (for degrees).
  2. When there are 2 columns in the temperatures (usually tmax and tmin), then the "same" and "diff" calculations should be for both of these columns. I think currently they are only for the first of them,i.e. usually for Tmax only.
  3. I used the data options to filter the data so the results would be for a single station. The filter worked fine, but the QC commands seem to ignore that. Could the dialogue please take account of the filters?
  4. The commands do a calculation e.g. (Tmax-Tmin) and then use this to define the filter. It would be useful if we could separate these two operations. So we do the calculation (producing a new column in the original data frame) This is sometimes 2 columns if repeated for max and min. Then we filter. This would mean the filtered values also include the calculated column, which is useful. For example when we filter because there is a big difference (say > 10degrees) from the day before we see the resulting value, but not that on the day before. It would be great if that were there as well. Now this can always be done by a series of calculations on the original data - so we can do it now. But ideally it would be possible automatically in the QC command. Then there could be a single checkbox in the dialogue "Keep calculated columns". Default could be checked. If you un-check, then it either doesn't produce the column, or it produces it and then deletes it.
rdstern commented 6 years ago

Some of the items above have been addressed. But others remain - and there are more! This is for the temperatures, i.e. 1) The label mm has been changed to deg C. Thanks 2) When 2 columns are given the calculations should be for both of them. 3) The QC should take account of a filter. 4) New - the cursed up-down are still there! This is actually a good dialogue for them. But is it possible for them to jump by 1 degree up or down. However typing values with one decimal place is still possible? This applied to Jump and Difference. If not, then change Jump to allow one decimal place. leave the others as they are. 5) If there is only 1 element, i.e. Element 2 is blank, then disable both Acceptable Range (Element2) and Difference. (They need both elements to make any sense! The two items below may be more difficult to include 6) Have a checkbox in the dialogue, default checked which says "Include calculated columns". If checked, then the calculated columns are added to the daily file. Then they will also (automatically) be included in the filtered sheet. a) For Jump this produced a column called Jump1 (and Jump2 if there are both columns). This is simply xn -x(n-1). b) For Difference it produces a column called Diff, which is just (usually) (Tmax - Tmin) 7) (This is likely also to need Danny/David help, or work by Beth) We need a factor column in the filtered sheet that shows which check was not satisfied. This is a factor column with as many levels as there are checks. It could have the labels Range1, Range2, Same1, Same2, Jump1, Jump2, Diff.

rdstern commented 6 years ago

On item 7, in discussion with David he said that the dialogue uses our calculation system. This means that "behind the scenes" it produces logical columns for the subset. So what should be easy is to be able to show these filter columns. It would all be very easy if they are initially shown in the main data frame. Then they will automatically be transferred to the subset. Perhaps this could be a checkbox on the dialogue - default checked - saying Include logical columns.

bethanclarke commented 6 years ago

@dannyparsons do we have any R-code that produces a factor column telling us which check was not satisfied?

dannyparsons commented 6 years ago

The logical columns would be easier because it is the same expression as the filter e.g. "Rain > 100" done as a mutate (calculation) will give the logical column

rdstern commented 6 years ago

I would be happy with those logical columns being produced. I assume there would be one for each check. I would be quite happy for them initially to be produced for the main data frame - and then the filtering would automatically carry them to the filtered data frame.

rdstern commented 6 years ago

I wonder if someone is working on these dialogues. The items above, raised on March 19th, still remain.

For example, in the temperature QC, when both elements have been included then still the QC checks are only on the first element.

Some may need David's help and he will be around from Thursday.

shadrackkibet commented 6 years ago

I am happy to look into this in the course of the week. @bethanclarke are you working on this?

rdstern commented 6 years ago

I hope that Beth will be looking into these dialogues when David is back. But that will be particularly to add the box-plot outliers, currently not enabled - rather than the other features.

shadrackkibet commented 6 years ago

ok, i will be looking into this.

shadrackkibet commented 6 years ago

For example, in the temperature QC, when both elements have been included then still the QC checks are only on the first element.

@rdstern i am i correct to say that you want this to be strict filters when two or more checks are applied? instead of or (|) this to be changed to and (&)?

For example when i check Acceptable Range(Element1) and Acceptable Range(Element2) we currently have. temp_filter <- instat_calculation$new(type="filter", function_exp="Tn <= 0 | Tn >= 30 | Tx <= 10 | Tx >= 50", calculated_from=list("Ghana_Data"="Tx"), manipulations=list(grouping), save=2, result_data_frame="Temperature_Filter"). Which is wrong because calculated from list doesn't include "Tn"column.

To

temp_filter <- instat_calculation$new(type="filter", function_exp="Tn <= 0 | Tn >= 30 & Tx <= 10 | Tx >= 50", calculated_from=list("Ghana_Data"="Tx","Ghana_Data"="Tn"), manipulations=list(grouping), save=2, result_data_frame="Temperature_Filter")

rdstern commented 6 years ago

Thank you for showing me this. I think it is the OR all the way through - as it was. And great that it now includes both Tmax and Tmin.

And that gives one of the filters. What does the option save = 2 do? Is there another setting that would enable the filter to be saved into the original data frame?

And do the other checks re-use the same filter name, or have a different one? Ideally they would be different?

shadrackkibet commented 6 years ago

save=2 mean the calculation and the result is saved. Whether there is a setting that will enable the filter to be saved into the original data frame i am not sure about it @dannyparsons can respond to this.

shadrackkibet commented 6 years ago

@rdstern on Same and Jump is the calculation supposed to check for both Element 1 and Element 2 if they are there? I initially implemented this for one element only i.e Element1

rdstern commented 6 years ago

That's correct.
And Difference is only sensible if both are there. So it is disabled if only one element is included.

shadrackkibet commented 6 years ago

Will it be good to auto fill Element1 and Element2 receivers at this stage?

rdstern commented 6 years ago

I would really like that. But there is an issue (in the future) that we may want sometimes to use the dialogue for other climatic elements, e.g. dry-bulb and wet bulb. And sometimes we want to just have Tmin as the only element. Danny seems to think this is difficult, i.e. if it is filled automatically, then currently this is the only element allowed.

I'll check, but this is (unfortunately) not urgent at this stage.

On the other hand I am really pleased that you are working on this dialogue. Are you able to check why it doesn't take account of the filters, or is that a Danny question.

David says that what is easy is to have a checkbox, which is, by default, checked. When checked it adds these elements. When unchecked they can be any element. He says you could perhaps do that. I am not sure whether that is a user control or not?

One consequence (in this particular dialogue is that I might want to do just the checks on Tmin, alone. Currently I do that by just putting it as Element 1. Now it would be natural to have them completed automatically, then turn off the checkbox and then delete Tmax. Then Tmin is there alone, but as Element2!

If this gets complicated, then leave for now. There are many other things to try to do - and ideally quite a list before Lesotho - where the QC will be important for your week.

shadrackkibet commented 6 years ago

On the other hand I am really pleased that you are working on this dialogue. Are you able to check why it doesn't take account of the filters, or is that a Danny question.

This is general with most climatic dialogs using calculation system. Danny said he will have a look at it.That is also at issue #4520.