IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

Homogenization of climatic data #5461

Closed shadrackkibet closed 4 years ago

shadrackkibet commented 5 years ago

Is your feature request related to a problem? Please describe. I have now started thinking about the implementation of homogen function from climatol package in R-Instat.

Describe the solution you'd like I suggest this as a special dialog in the climatic menu. I am yet to come up with the design of the dialog. This is still at an early stage as the Rcode is not yet ready.

Describe alternatives you've considered There exist several methods to homogenize climatic data. Climatol R package implements the famous SNHT which is a widely accepted. Good to check other R packages as well.

Additional context I have tried using this function in R initially, applied this on KMD dataset. A suggestion is to ask the author of climatol package to add an option to pick a dataframe, this will make it easy for us to implement this in R-Instat. Currently, this function has lot's of outputs which as printed in pdf format. Not sure if we want all this. I think we just need the homogenized column. This is currently saved as a separate file .dah, it will be good if this is saved as a column in the original dataframe. This function also prints some summaries, we might want this printed in the output window. There are other separate files saved in the directory.

Below is my code, at the moment this is just for viewing, I will edit it soon and supply sample files so that it is reproducible.

#loading required libraries
library(climatol)
library(tidyverse)
library(reshape2)
library(magrittr)

setwd("F:\\AIMS Cameroon\\AIMS Essay\\data")
rainfall_monthly <- read_csv("rainfall_monthly_data_kmd.csv")
est.c <-  read_csv("synoptic_stations.csv")

#reshape data into the required shape.
dat <- dcast(data = rainfall_monthly, formula = year + month_val ~ station_id, value.var = "sum_Rain")
dat %<>% select(V1="8635000",V2="8639000",V3="8641000",V4="8737000",V5="8834098",V6="8840000",V7="8934096",V8="8935115",V9="8935181",V10="8937022",V11="8937065",V12="9034025",V13="9034088",V14="9034103",V15="9035279",V16="9035318",  
                V17="9036135",V18="9036261",V19="9036288",V20="9037202",V21="9037262",V22="9039000",V23="9135001",V24="9136087",V25="9136130",  
                V26="9136164",V27="9136168",V28="9136208",V29="9136269",V30="9137048",V31="9137089",V32="9237000",V33="9240001",V34="9338001",  
                V35="9339036",V36="9340007",V37="9340009",V38="9439021")

dat <- as.matrix(dat)

#writing required files
write(dat,'pcp_1973-2018.dat')
write.table(est.c, 'pcp_1973-2018.est', row.names = FALSE, col.names = FALSE)
homogen('pcp', 1973, 2018, std = 2, nm = 12) #homogenization

#loading homogenized data
View(import("F:\\AIMS Cameroon\\AIMS Essay\\data\\pcp_1973-2018.rda",which = 'dah'),title = 'homogenized_data')
dannyparsons commented 5 years ago

This is a good starting point. Once we have a full R script we can work on implementation into R-Instat and possible changes we might propose for the package.

shadrackkibet commented 5 years ago

@dannyparsons I now have suggestions to propose to the author of climatol package.

  1. An option to pick a data frame in a tidy format as shown below to be added. This will then make it easy for us to implement his function in R-Instat.
  2. The current homogen() function works with one element at a time. It will be nice to add an option to pick an element\elements from the data frame.
  3. lastly, the homogenized \ Infilled columns should be added back to the data frame. Currently, it is saved as a separate file. image
rdstern commented 5 years ago

@shadrackkibet I want to make sure I have understood what you want in your 3 requests.

First is that you want it to be able to read the initial data from a "tidy" data frame. Should we specify what this tidy data frame will contain, in relation to the way he reads the data now? I notice the example you give is monthly data and the dates are in 2 columns. And the name is given in 2 ways, namely station_id and Name. And you don't mention whether the "tidy" nature means that there are no missing dates (i.e. we have first used infill). I think he assumes that in his formulation, because you don't give a date in specification - as far as I remember.

So, on your item 1 I am keen to be clear what we are asking? I hope I am understanding things correctly.

In your item 2 I think you are saying that climatol currently works one element at a time and you would like it to cope with multiple elements at one go? Am I right? If so, then I am not sure I agree, at least initially. Checking for homogeneity and infilling is a big deal and should be done carefully and will often involve many stations. To involve many stations and many elements together would seem to be to be a risky thing to do.

In the longer term I could imagine that checking for homogenisation and doing infilling on multiple variables (e.g. tmax and tmin) together. But that would be perhaps different methods that do a joint analysis - and that's different. If it is just for convenience, so that KMD can homogenisa and infill all elements together just from one button, then I would be very cautious at this early stage.

Now your third request, which is to write back to the same data frame. This would be great. But I wonder if just writing the data back to a data frame is more practical, at least initially? Then we can merge later. I am just conscious that we are already having problems ourselves on much simpler issues of saving model results (fitted values and residuals) back to the same data frame as the data. And the issues are largely because of missing values. So, I wonder if we should start by asking to write back to a data frame, even if not the same one?

I would also welcome ideas from @dannyparsons . Then I am happy for Danny, or I to write to Jose. His e-mail is available with the climatol package, so you may also want to start by writing directly? If so, then the first message could mention who you are and what you are planning to do and also that this message follows the Open CDMS meeting in Reading in May this year with Danny Parsons and Roger Stern.

shadrackkibet commented 5 years ago

@rdstern thanks for your insights, below is my response.

Should we specify what this tidy data frame will contain, in relation to the way he reads the data now?

Yes, I think we need to be specific. Currently, his package assumes infilling which I assume we do it from daily level and then we can obtain monthly/yearly/seasonal summaries. I am not sure if we can directly infill monthly/yearly data at the moment. homogen function has an option nm to be set when you are dealing with yearly/monthly/seasonal data. Since this function is mainly meant to homogenize/infill monthly, yearly and also seasonal data, I suggest we provide a format that allows these options. At least we have a year, month and other within variables(quarterly e.t.c). For the station name and ID, I suggest we have at least one since we are suggesting one file. His package includes one(Station ID) in one file and in the other it includes both.

I think you are saying that climatol currently works one element at a time and you would like it to cope with multiple elements at one go? Am I right?

Yes, that is right. Initially, I believe this is because of the data format(s) and multiple files that homogen function is taking, this is limiting the possibility to homogenize many elements. if we are suggesting now a 'tidy' format, this might make it easy to deal with multiple elements and multiple stations at a time. I might be wrong about this. But for a start and not to complicate things, If we are able to deal with one element for multiple stations at a time that will be great.

I wonder if we should start by asking to write back to a data frame, even if not the same one?

I am happy with either option, I am not sure what will be easy for him. I think both of these options will make it easy for us to implement his function in R-Instat. It will be good to give him both options and see what he will prefer.

Lastly, I would prefer Danny or you to write to the author. I could initially or follow up with a message introducing myself and mention the work we are doing at KMD and also linked to my thesis.

shadrackkibet commented 4 years ago

In connection with this. I am also looking at what CDT does for homogenization. This week I have met with one of the authors of CDT from Madagascar(Rija Faniriantsoa) in a workshop here at KMD on the use of CDT v6. I am about to discuss with him in detail how CDT does homogenization. He mentioned to me that he implemented homogenization in CDT using changepoint package which documentation can be found here I will be investigating this package as well. @rdstern did you check this package at some point?

rdstern commented 4 years ago

@shadrackkibet that is great. I didn't check changepoint, but was hoping to be more in contact with Rija at some point. I think he is being modest and my impression is that he is the main (only?) programmer of cdt. So great to discuss with him in as much detail as you can.

shadrackkibet commented 4 years ago

Aside from that I also think at some point we should be able to provide facilities to reshape data from CDT format directly to "tidy format". I have found several files here at KMD which are in CDT format. We should be able to handle this better rather than using reshape facilities like stacking.

And this reminds me of another issue that has come to my attention. The new tidyr package has two new functions pivot_longer() and pivot_wider() and Hadley claims that these are more understandable than melt and dcast in reshape2 and spread and gather in tidyr. Perhaps we should rethink about renaming our dialogs? I am not sure how this might improve what we have.

dannyparsons commented 4 years ago

We can easily have a special dialog/option in tidy dialog to convert the CDT format to our standard tidy format. It's a nice idea and a good addition for MET services. Could you provide a "dummy" file in the CDT format? Do not send an actual KMD file but a file in this format with some dummy data possibly?

shadrackkibet commented 4 years ago

Here is a dummy file dummy_CDT_format_data.xlsx

dannyparsons commented 4 years ago

Great, let's separate this out into its own issue to discuss further.

rdstern commented 4 years ago

From a quick look, the changepoint package looks fine, and should be really easy to include in R-Instat. I also like what cdt are doing generally, so we could argue that it has the cdt "seal of approval". Good to discuss further with Rija.

It is not as "automatic" as climatol and in some ways I like that. It does an important step and leaves us to do the rest. I assume it works a station (and element) at a time?

shadrackkibet commented 4 years ago

I suggest the following on this topic (Homogenization). There are several methods which could be used to check for inhomogeneities namely;

Climatol package uses SNHT which is widely used (Reference) as part of the implemented algorithm that also incorporates the use of neighbouring stations but inputs the missing values 'automatically'. As we continue to discuss the implementation of homogen function, I suggest the following as part of our 'halfway' dialogs. Since these are just like hypothesis tests, we add a keyboard (Homogenization) with 6 buttons as follows;

We can use trends package which is available on CRAN and it is maintained. Another one is climtrends package which has most of these functions implemented but it was removed from CRAN. Again, these algorithms do not look complicated we could implement our own.

rdstern commented 4 years ago

I like the idea of the trends package being included in the hypothesis testing dialogue. That has many of the tests. In your list it doesn't have Craddock's test. Is there a good reason for doing more (at this stage) than implementing most (or all) of the tests in the trend package.

Some assume a ts variable, so that is also needed.

I have added the 0.5.6 milestone so we discuss and hopefully make some progress during the code-sprint.

rdstern commented 4 years ago

I have looked in slightly mode detail at the trends package. It is all tests (delivers htest objects). A few of the tests only accept variables of type ts, most also accept a vector. Disappointingly most do not handle missing values. A few just omit the missing values - which seems an obvious route.

Then I looked for similar tests in the DEscTools package. I rather like adding this package to R-Instat for a variety of tasks. It includes Bartels tests, and the runs test and many others. I like and prefer the way those tests are done compared to the trends package.

You mention that the most important is the snht test and there is an snht package. That looks good. It produces a data frame. It may also have a 2 station version, possibly to detect changes if there is another station that is compared at the same time. If so, that could be great.

Then there is the changepoint package. Here is a good-looking R session by the author.

This was before her recent package - changepoint.np and EnvCpt.

I looked briefly at the help inform ation from climtrends (removed from CRAN). I am not sure there is much there that we need.

I suggest this needs more reflection - there is lots to do. Exciting. More to follow!

rdstern commented 4 years ago

@shadrack this all seems to be a useful discussion. I have just proposed a set of hypothesis tests in the Model > Fit Model > One Variable dialogue, #5810. That seems also a perfect place to include some homogeneity functions. In particular I have included shn for the snht test in the trend package. (There is the snht package, which I think permits missing values, but that doesn't return an object of type htest, which is something I like there, at least for consistency.)

Could you check that the other functions I have suggested in this section in #5810 are the important ones, and that I have used the best version.

I am also just about to suggest a new keyboard in the Model > Fit Model > Hypothesis Tests Keyboard for the trend package, so I don't want the list in #5810 to be too long. In addition, I note that above you propose a special dialogue for homogenisation and that's fine too. That would go in the quality control menu, perhaps before boxplot.

shadrackkibet commented 4 years ago

I and Helena discussed this task and agreed to draft a new homogenization dialog which we suggest under Climatic > Check data >Homogenization

This will be a simple dialog. We suggest two broad options at the top (radio buttons). This dialog will make use of changepoint, changepoint.np, climatol, trends, snht, DEscTools R packages in implementation. We use snht as recommended in the WMO guide on homogenization practices.

NOTE We start simply first, we assume data for a single station at a time! and the neighbouring station/satellite station is homogeneous. So we add the station receiver just to show the intent. changepoint package methods do allow missing values, so we ignore missing values.

The output from this dialog will be printed in the output window.

The dialog will be programmed in phases.

Preliminary (This can involve an intern(s) who would like to be introduced to programming R-Instat.)

Univariate case

Multivariate case

The dialog is as shown in the sketch below.

IMG_20200610_151128_558

@rdstern what are your thoughts on this? If this looks sensible then we could start work on this dialog.

rdstern commented 4 years ago

That looks good. In the dialogue you only have a single second receiver. I wonder about 3 buttons at the top with a label on the left saying Station(s). Then the three buttons could perhaps be Single, Neighbouring, Multiple. Alternatively, with 2 buttons, they could be Single and Multiple. Then the second receiver is only visible when Multiple is chosen and is a multiple receiver.

The rest of the dialogue looks good. At the bottom there should be a Save checkbox with label Save Model and a receiver if that is checked. I have not checked what is produced in each case. Maybe the label should be Save Result: I have not checked on the form of the result in each case?

You imply that Helena would also like to become involved in the development of the dialogue, which is great. Am I right? If so, then I suggest a way she could start is as follows: a) Helena looks at the issue you have just created in Github - we will soon add it to taiga. Then @dannyparsons can add her to the list of nominated people. b) Helena downloads and installs Visual Studio and adds the R-Instat project c) She now looks at Wyclife's branch which has added the trend package keyboard. d) She doesn't need to critique the dialogue - that can come later. But she tries out some of the proposed tests. e) (This could be now - before a - d above) In the current version 6.1 Helena adds the changepoint package. She will have to have admin rights when she uses R-Instat. That is usually easy in Windows 10 to do. f) She can then try the changepoint mean, var and mean/var functions by typing the appropriate commands into the Model > Fit > Hypothesis Tests keyboard. g) If that works ok, then those functions might be added to the Trends keyboard? h) As part of this one initial activity is to agree on sample datasets to use to test the homogenisation? What do you suggest? Could that be say 3 stations from KMD, where you might be able to assume that the KIA (or Dagoretti) stations are homogeneous and there could be others near Nairobi to test on. Maybe a time to check with KMD?

That looks like good progress. I am happy to chat with you both with skype whenever you feel it would be useful. I can also help Helena directly with some of the above tasks, so you are free to help her on the bits I can't do.

shadrackkibet commented 4 years ago

I am not sure Helena will like to be involved in programming this dialog at this stage however, she is interested in the R parts. I will be programming the dialog at the moment and happy to involve any other intern.

rdstern commented 4 years ago

The snht package also includes pairwisesnht where there are multiple stations. I suggest that (at least initially, and possibly all we need) could be to add this facility to the current dialogue for both the neighbouring and multiple sites in the following ways: a) If there is a single neighbouring site it will often be the satellite estimate, and hence be another variable. The dialogue needs them stacked, which would be done behind the scenes and then it needs a distance matrix added - which can contain anything here, plus the number of sites, which is always 1. So that will need a bit of R, but should be straightforward. b) If there are multiple stations, then they are likely to be stacked already. There would then need to be another data frame with the locations and the distances would then be worked out and included in the command.

This could be reasonably quick to do. We should check with the WMO guide, but this would anyway be a very good start. If necessary this work could involve @HelenaMwangila and/or @lilyclements particularly for the R calculations.

FONT_1245_DRAFT 2_en-no tracks.docx

Looking quickly at the attached WMO guide implies the use of the snht package could be sufficient for now.