bluefoxr / COINr

COINr
https://bluefoxr.github.io/COINr/
Other
25 stars 8 forks source link

bug in "impute.R" impute_panel part #63

Closed liuyanguu closed 1 month ago

liuyanguu commented 1 month ago

Thank you so much, Will, for the great package. We (the UNICEF data team) are developing a new index and want to transfer it into the COINr framework.

I noticed a minor bug in the stats::approx part, as stats::approx cannot run if there is only one non-NA value. https://github.com/bluefoxr/COINr/blob/0d45107b5358a7dbfa7c52a1921ba37035f9bba0/R/impute.R#L1068

An easy fix is like below. Note that I impute constant value using the only non-NA value if this is the case:

        if(all(is.na(y))){
          message("NOTE: cannot impute for unit ", uCode, " and iCode ", iCode, " because all NA values.")
          next
        }
        if (sum(!is.na(y)) == 1) {
          message("NOTE: only 1 non-NA value for unit ", uCode,
                  " and iCode ", iCode, " impute constant values using the non-NA value.")
          # impute constant
          y_imp <- rep(y[!na_positions], length(y))
        } else {
          # impute with linear, and extremes are imputed with the closest value
          y_imp <- stats::approx(x, y, xout = x, rule = 2, method = imp_type)$y
        }

Please let me know if you would like me to create a pull request --- or maybe you could just easily fix it.

Here is a quick reproducible example:

# copy
library("COINr")
dfp <- ASEM_iData_p

# create NA for GB in 2021
dfp$LPI[dfp$uCode == "GBR" & dfp$Time == 2022] <- NA
dfp$LPI[dfp$uCode == "GBR" & dfp$Time == 2021] <- NA
dfp$LPI[dfp$uCode == "GBR" & dfp$Time == 2020] <- NA
dfp$LPI[dfp$uCode == "GBR" & dfp$Time == 2018] <- NA

ASEMp <- new_coin(dfp, ASEM_iMeta, split_to = "all", quietly = TRUE)

# make purse with fresh panel data
ASEMp <- new_coin(dfp, ASEM_iMeta, split_to = "all", quietly = TRUE)

# extract and plot time series for GBR for indicator LPI
df_plot <- get_data(ASEMp, dset = "Raw", iCodes = "LPI", uCodes = "GBR")
plot(df_plot$Time, df_plot$LPI, type = "b", xlab = "Year", ylab = "LPI for GBR")

# impute- -----

# "constant", "linear", latest are the three imp_type options
ASEMp <- Impute(ASEMp, dset = "Raw", f_i = "impute_panel", f_i_para = list(imp_type = "linear"))
df_plot <- get_data(ASEMp, dset = "Imputed", iCodes = "LPI", uCodes = "GBR")
plot(df_plot$Time, df_plot$LPI, type = "b", xlab = "Year", ylab = "LPI for GBR")

After fixing, the output will be like: image

In the end, I have a quick question: is there a function that can convert the purse back to the panel data? Basically reverse the following process:

ASEMp <- new_coin(dfp, ASEM_iMeta, split_to = "all", quietly = TRUE)

Because now I have added "Imputed" dataset to every coin, wonder if there is a quick way to reconstruct the panel data using the imputed datasets? I wonder after creating the Imputed dataset,

bluefoxr commented 1 month ago

Hi @liuyanguu, thanks for bringing this up.

I'm not able to reproduce the problem you describe though, or perhaps misunderstanding you. My understanding is that you are not able to impute if there is only one non-NA value.

For Impute() we have the three options as I think you know: "latest", "constant", and "linear". The latter two use stats::approx(). The first two are pretty similar (with some small differences described in the documentation), and just copy the latest or earliest values to missing points. The "linear" option however performs a linear interpolation. Logically, for the first two, we only need one non-NA point, whereas for "linear" we would need at least two, since you can't do linear regression with one point.

So following your example:

dfp <- ASEM_iData_p

# remove all points for GBR/LPI except 2019
dfp$LPI[dfp$uCode == "GBR" & dfp$Time %in% c(2018, 2020:2022)] <- NA

# make purse with fresh panel data
ASEMp <- new_coin(dfp, ASEM_iMeta, split_to = "all", quietly = TRUE)

# extract and plot time series for GBR for indicator LPI
df_plot <- get_data(ASEMp, dset = "Raw", iCodes = "LPI", uCodes = "GBR")
plot(df_plot$Time, df_plot$LPI, type = "b", xlab = "Year", ylab = "LPI for GBR")

# confirms only one point in GBR/LPI time series...

# impute- -----

# imputation with "latest"
ASEMp <- Impute(ASEMp, dset = "Raw", f_i = "impute_panel", f_i_para = list(imp_type = "latest"))
df_plot <- get_data(ASEMp, dset = "Imputed", iCodes = "LPI", uCodes = "GBR")
plot(df_plot$Time, df_plot$LPI, type = "b", xlab = "Year", ylab = "LPI for GBR")

I get:

image

# imputation with "constant"
ASEMp <- Impute(ASEMp, dset = "Raw", f_i = "impute_panel", f_i_para = list(imp_type = "constant"))
df_plot <- get_data(ASEMp, dset = "Imputed", iCodes = "LPI", uCodes = "GBR")
plot(df_plot$Time, df_plot$LPI, type = "b", xlab = "Year", ylab = "LPI for GBR")

I get:

image

# imputation with "linear"
ASEMp <- Impute(ASEMp, dset = "Raw", f_i = "impute_panel", f_i_para = list(imp_type = "linear"))
df_plot <- get_data(ASEMp, dset = "Imputed", iCodes = "LPI", uCodes = "GBR")
plot(df_plot$Time, df_plot$LPI, type = "b", xlab = "Year", ylab = "LPI for GBR")

Here I get an error, but this is what I would expect since we cannot do linear interpolation with 1 point. Not sure if you were expecting to be able to impute with the linear method on one point? If so I would just say to switch to "constant", or else if I misunderstood, let me know.

For the last question, the function you are looking for is get_dset() I think. On the example above:

dfi <- get_dset(ASEMp, dset = "Imputed")

If you want the group variables etc, use the also_get argument.

bluefoxr commented 1 month ago

Sorry - actually I think I might have understood your meaning - you don't want the imputation to fail as a result of one or more time series that may only have one observations - whereas there may be others where the imputation is possible. I see your point. Let me get back to you.

liuyanguu commented 1 month ago

Yes, that's what I meant! :) And in such one-observation case, I would suggest just making it constant.

Sorry - actually I think I might have understood your meaning - you don't want the imputation to fail as a result of one or more time series that may only have one observations - whereas there may be others where the imputation is possible. I see your point. Let me get back to you.

bluefoxr commented 1 month ago

OK so I decided to include a new imp_type which reverts to the "constant" method for one-observation series. So you should now run:

ASEMp <- Impute(ASEMp, dset = "Raw", f_i = "impute_panel", f_i_para = list(imp_type = "linear-constant"))
liuyanguu commented 1 month ago

Thank you so much Will! But I get a very strange situation: the new code does work with the example above and everything looks good. But it just doesn't work with my data. It gives the error directly even without those messages:

Error in stats::approx(x, y, xout = x, rule = 2, method = imp_type) : 
  need at least two non-NA values to interpolate

If imp_typ = constant it works normally, but doesn't work with linear or linear-constant I tried many ways and cannot figure out why ... I have attached the purse rds, could you kindly give a try? LCI_raw.zip

LCI_raw <- readRDS("LCI_raw.rds")
LCI_imp <- Impute(LCI_raw, dset = "Raw", f_i = "impute_panel", f_i_para = list(imp_type = "linear-constant"))

Many thanks!!

bluefoxr commented 1 month ago

Hi, yes my bad, I made a little mistake in the code. I have now fixed this - please try the latest version and it should work.

liuyanguu commented 1 month ago

I confirm this is fixed. Thank you so much, @bluefoxr, for accommodating our special request! Much appreciated.