IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

Verification of satellite data in R-Instat #5630

Open rdstern opened 4 years ago

rdstern commented 4 years ago

This is an important area to work on and also to add facilities to R-Instat. I collect here the initial e-mail correspondence by Shadrack and Danny. I then consider what we might do in R-Instat, starting with activities in the forthcoming code sprint.

"Dear Roger,

This is just to report some of my findings on the methods used to compare Satellite and in-situ data. Following up on my discussion with Rija (CDT developer) he shared some papers. One looks interesting for me and can be found here. However, I am still going to check others. There are various methods used in this paper which include;

  1. Bias - which we have in R-Instat but the way it is calculated here is a bit different, so we might want to offer this option as well.
  2. Mean Absolute Error (MAE)
  3. Root Mean Square Error (RMSE)
  4. Pearson's correlation (r)
  5. Probability of Detection (POD)
  6. False Alarm Ratio (FAR)
  7. Critical Success Index (CSI)
  8. Volumetric Hit Index (VHI)
  9. Volumetric False Alarm Ratio (VFAR)
  10. Volumetric Miss Index (VMI)
  11. Volumetric Critical Success Index (VCSI)
  12. More in the "hdroGOF" package We already have some of these somewhere in R-Instat and there is also a workaround in R-Instat to obtain some of these. The "hydroGOF" R package has a number of these measures implemented and Rija uses this package for some of the methods he implemented for CDT. I wonder if we can take advantage of this package and implement some in R-Instat. However, algorithms for these methods look simple so implementing isn't a huge task at least. My thoughts. Regards, Shadrack"

Danny replies as follows:

"Hi Shadrack,

That list looks good. It's similar to what I've seen others using. I've separately sent some papers from Ghana doing comparisons, but I think they mainly use methods from your list.

This site is very comprehensive with methods . It's called methods for forecast verification but it mentions these are general for comparing two columns of data. What I like about the site is it describes what each method will tell you and what it is appropriate for and it's advantages and disadvantages.

Danny"

The site from Danny is excellent. I now use these two messages as background to propose some activities to make the verification easy - and useful - in R-Instat of course.

rdstern commented 4 years ago

I therefore suggest we consider the following activities:

  1. Following from the ideas above there are 1 (or 2 or 3) obvious packages to add to R-Instat, namely verification (we should add now) and probably hydroGOF. And, if we add hydroGOF we might also consider adding its partner, namely hydroTSM. a) Verification is comprehensive - as far as it goes - in comparing forecast and actual data. That's useful to have anyway. And (for us) the immediate use is to compare satellite and station data. By "comprehensive" I mean that it copes with data of different types. So it can compare continuous/continuous, e.g. temperature values, or categorical/categorical, e.g. heavy rain, rain or no rain - a 3 by 3 table, or binary (2 by 2 table). I have tried the key functions and we may later want to improve on them. But it is sensible to start with them in their current form. The main command is verify and this calls other functions from the package, depending on the type of the data. (It has graphs - not ggplot and I have not yet tried them. The last update was 2015.) b) hydroGOF titles itself as "Goodness-of-Fit Functions for Comparison of Simulated and Observed Hydrological Time Series". This is what @shadrackkibet states is used (or adapted) in the CDT package. Good to have some specialist hydrology packages in R-Instat anyway. Sends the right sort of message to users. c) So we might also consider adding its partner, hydroTSM as well it is "Time Series Management, Analysis and Interpolation for Hydrological Modelling" It includes some potentially interesting data sets at least.

  2. If we just add a new package in this way is there anything we can do simply to be able to use functions from a new package reasonably easily. It may just be instructions, or perhaps a new simple dialogue. Here is what I did to use examples from the verify function from the verification package. a) File > New Data Frame and read the data from the example into 2 variables called obs and pred.

image

b) Then I used Model > Hypothesis Tests > Oneway (obs ~ pred)to be able to access the code. This was as follows:

data1 <- data_book$get_data_frame(data_name="data1")
attach(what=data1)
Last_Test <- oneway.test( obs~pred)  

Then I swapped the last line above for the lines from the package documentation (roughly), namely:

A<- verification::verify(obs, pred , frcst.type = "binary", obs.type = "binary")
summary(A)

I added the package name in the line above. It ran fine and (for reference) I give some of the results below. image

I then slightly edited the remaining lines in the code that had been generated - to save A rather than Last_Test. The code was then:

data_book$add_model(model_name="A", model=A, data_name="data1")
data_book$get_models(data_name="data1", model_name="A")
detach(name=data1, unload=TRUE)
rm(list=c("A", "data1"))

This saved the object, A into the data book and I could use Prepare > R Objects > View to look at the contents.

This was all satisfactory. Could we make the use of a new command, from a new package even easier?
a) There could be a dialogue to assist in installing a new package - see #5601. b) Perhaps a dialogue adding comment lines to a version of the code above, so ready for a new line to be inserted?

  1. hydroTSM has some interesting sounding data sets. Not many can read into R-Instat, some because they are zoo objects. Could these be used to investigate how a few more structures can be read into a data book in R-Instat?
rdstern commented 4 years ago

This is however an example of a topic where I think we should make it easy to go beyond what is routine with these packages. But before that, perhaps we should make it easy to at least do some of what is in these packages. I wonder how? Please can we discuss, to conceptualize, as part of the code sprint?

But the paper here has the important title as follows: "Measuring forecast skill: is it real skill or is it the varying climatology?"

In simple terms we should not just get these skill scores for the whole variables. This will be for multiple stations and multiple seasons (e.g. months) and multiple results (e.g cold days).

It may be that the satellite gives good estimates at some stations and not at others, or in some months and not in others. Or it may model low rainfalls well,but not large rainfalls.

We should be able to investigate these aspects easily. Now look at the above printout from the verification package. It routinely gives lots of skill scores for a single sample (group of data). We would (often instead) like to know the value of a single skill score for each of many groups. Then - in particular, we would like to know whether we need to break-up the data into many groups - that's statistical modelling! It will use lm for continuous data and loglinear models for the binary data. So we need to make sure our model menu is working ok!

I suggest one way we may be able to make quick progress - perhaps in the code-sprint in February 2020. a) With the verification package we add it to the Model > Hypothesis Tests dialogue. It would be a new keyboard and at least we add the verify command. We could simplify the coding needed by having a group of keys with an overall label of verify. Then the keys could perhaps be binary, cat (for categorical) or cont (for continuous). Note there are other combinations in the package, but these are for verifying a forecast, not simply for comparing two variables. b) Could we now include the By control that is optimistically in this dialogue. I would be happy if initially it were just for a single factor - the control is for a multiple receiver. I hope this might be so simple as adding tapply to the command line?

I hope we can discuss this - it would be a great start in adding these facilities and, in general, for our halfway dialogues.

shadrackkibet commented 4 years ago

(a) is easy and straightforward to complete. Is someone working on it? Otherwise I can quickly add these keys.

rdstern commented 4 years ago

I propose we start by implementing the hydroGOF (goodness of fit) package in the calculator and then in the Statistics > Column: Reshape > Summaries dialogue.

a) In the Summary dialogue we will automatically be able to include factors, e.g. Station, Month in the calculations. b) The statistics - listed elsewhere - should go into the 2-variable tab. c) I suggest any 2 be added first - they can go anywhere on the tab. d) Then we need to specify the design and anyone should be able to complete the dialogue.

rdstern commented 4 years ago

Some minor edits suggested. In addition the default output is not what is needed.
A general solution is to add 3 check-boxes One is Display Model and the default is for that to be checked. This is the line of R code there currently - the Get line The second is Display Summary. If checked then the Get line is repeated, with Summary( ) round it. I thought @volloholic suggested 3 checkboxes, but what was the third?

rdstern commented 4 years ago

I am reopening this issue. I am really happy with the progress we have made with the hydroGOF package, and that part is complete. There is a second package - called verification - and that is more general and discussed above. This should soon become its own issue, because (with Danny's current work) we will be working on it in the next few months.

shadrackkibet commented 4 years ago

@rdstern I think everything has been sorted out here. Can this be closed?

rdstern commented 4 years ago

I am happy for this to be closed. It concerns the summary statistics used for verification and hence those that are in the verification and hydroGOF packages.

However, a small task I found recently, for @shadrackkibet or another of the AIMS interns is that the openair package seems to have a useful similar function, called modstats, and also possibly useful references as follows:

Legates DR, McCabe GJ. (1999). Evaluating the use of goodness-of-fit measures in hydrologic and hydroclimatic model validation. Water Resources Research 35(1): 233-241.
Legates DR, McCabe GJ. (2012). A refined index of model performance: a rejoinder, International Journal of Climatology.
Willmott, C.J., Robeson, S.M., Matsuura, K., 2011. A refined index of model performance. International Journal of Climatology.   

We can't use the function as it stands, but may find something useful to rewrite ourselves. Hopefully we will find that the statistics are included in those we already have.

There is a manual for openair here. The function is described in Chapter 27.

shadrackkibet commented 3 years ago

@HawardKetoyoMsatsi you have been looking into the openair package recently. Could you investigate modstats() function to check if there is any other measure we could add to the current compare summary statistics?