IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Climatic summaries and censored data #4112

Open rdstern opened 7 years ago

rdstern commented 7 years ago

We had a problem with the end of the season in Guyana. We used the water balance and asked for the date of the end of the season between mid-June and the end of August. Most of the results were missing.

This is currently confusing. There is no way to distinguish between missing data ( missing days) in the calculations and the event not happening.

This has always been an issue and the old Instat gave the day number as 0 (zero) when the event did not occur. It gave missing when there were missing data - or at least it should have done!

In R-Instat we should do this properly!

It is simply an example of right censored data, which is very common with survival data, ( studies where the time of death are recorded, but some people may still be alive at the end of the study.) so we should have the option to store the results as censored data are commonly stored.

This is to give a second column In the surv command in the survival package this column is described partly as:

event: The status indicator, normally 0=alive, 1=dead. Other choices are TRUE/FALSE (TRUE = death) or 1/2 (2=death).

In our case having a result (date) is the dead category - that's when you have a value. So, we could generate a second logical column, where the NA (in our end dates) are given as 0 and the actual dates are 1.

I would slightly have preferred the other way round, but suggest we start with this. This column would have 3 values, namely 0 (end date found in the year), 1 (no end date found, hence NA in the main result column, and NA (as in the main result column) if there are missing values in the data.

In addition, when we are able to cope with multiple missing value codes, then we will have a different tagged code for these special NAs. Then it is detectable in the single column, so the generation of this second column is an option. So we make it an option as from now.

This is needed for start of rains and end of the rains dialogues.

It could be a checkbox with label No Date Found Variable or Add Status Variable, or just Status (I think I prefer the second. Default is checked. If checked it produces an additional logical column with the same name as the main column and additional _s (s for status).

rdstern commented 6 years ago

There is a (sort-of) related issue and that is to add an option to the same commands (start and end and day of extreme) that also provides a further date column. So the main column is (as now) the day-of-year, but there is a checkbox "Add Date column" and this adds a firther column, which is the date (including the year, i.e. a proper date! of the start and/or end.

rdstern commented 6 years ago

I have kept this for the discussion and also because the topic is currently being worked on.

rdstern commented 7 months ago

@lilyclements I am returning here now, partly because the topic was mentioned with CIMH when the end of the season was missing. And (happily for me) it returned again when discussing e-picsa!

In 2017 Danny said we could usefully wait until a promised enhancement in dplyr was implemented. So we should be ok now.