IALSA / HRS

Shaping data from the Health and Retirement Study.
GNU General Public License v2.0
5 stars 2 forks source link

Grabbing first non-missing value and forcing it across all waves within same hhidpn #9

Closed casslbrown closed 7 years ago

casslbrown commented 7 years ago

In 2-laundry-room. R I'd like to force the education variable degree to be the same across waves for each participant similar to this:

screen shot 2017-03-12 at 5 06 12 pm

but taking the first value only if it is not missing and if it is missing taking the first available non-missing value and forcing it across waves for each hhidpn.

andkov commented 7 years ago

Very well formulated question, @casslbrown. This should be easy. In fact, let me take this opportunity to demonstrate how to create a reproducible example. This type of question is expected at www.stackoverflow.com, so getting the form down should empower you to have the advice of more skilled programmers.

First, create a the simplest data object possible. It should have just enough complexity to test all the functionionality that is desired:

ds <- data.frame(
  "id" = c(1,1,1,2,2,2,3,3,3,4,4,4),
  "time" = c(1,2,3,1,2,3,1,2,3,1,2,3),
  "value" = c(4,NA,4,NA,5,5, NA,6,7,NA,NA,NA),
  "string" = c("A",NA,"A",NA,"B","B", NA, "C","D",NA,NA,NA)
)

Another, more convenient way ,to offer people whose consult you seek the data object is to use the dput function to produce the following code:

> dput(ds)
structure(list(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4), time = c(1, 
2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3), value = c(4, NA, 4, NA, 5, 
5, NA, 6, 7, NA, NA, NA), string = structure(c(1L, NA, 1L, NA, 
2L, 2L, NA, 3L, 4L, NA, NA, NA), .Label = c("A", "B", "C", "D"
), class = "factor"), value_fixed = c(4, 4, 4, 5, 5, 5, 6, 6, 
6, NA, NA, NA), string_fixed = structure(c(1L, 1L, 1L, 2L, 2L, 
2L, 3L, 3L, 3L, NA, NA, NA), .Label = c("A", "B", "C", "D"), class = "factor")), class = c("tbl_df", 
"tbl", "data.frame"), .Names = c("id", "time", "value", "string", 
"value_fixed", "string_fixed"), row.names = c(NA, -12L))

In either way, this produces the following data frame:

> ds
   id time value string
1   1    1     4      A
2   1    2    NA   <NA>
3   1    3     4      A
4   2    1    NA   <NA>
5   2    2     5      B
6   2    3     5      B
7   3    1    NA   <NA>
8   3    2     6      C
9   3    3     7      D
10  4    1    NA   <NA>
11  4    2    NA   <NA>
12  4    3    NA   <NA>

Note that I have included both numeric and character examples. Also note, that the four individuals, for whom I have created observation offer full range of potential cases. We want to make sure that our function handles all probable cases.

Although there are MANY ways to do anything in R, here's the solution I came up with:

ds <- ds %>% 
  dplyr::group_by(id) %>% 
  dplyr::mutate(
    value_fixed =  sort(unique(value))[1],
    string_fixed = sort(unique(string))[1]
  ) %>% 
  dplyr::ungroup()
> ds
# A tibble: 12 × 6
      id  time value string value_fixed string_fixed
   <dbl> <dbl> <dbl> <fctr>       <dbl>       <fctr>
1      1     1     4      A           4            A
2      1     2    NA     NA           4            A
3      1     3     4      A           4            A
4      2     1    NA     NA           5            B
5      2     2     5      B           5            B
6      2     3     5      B           5            B
7      3     1    NA     NA           6            C
8      3     2     6      C           6            C
9      3     3     7      D           6            C
10     4     1    NA     NA          NA           NA
11     4     2    NA     NA          NA           NA
12     4     3    NA     NA          NA           NA

Note that sort() removes NAs:

> a <- c(NA,2,NA,4, NA,3)
> sort(a)
[1] 2 3 4

which comes handy in this case.

It appears that this produces exactly the result you've been looking for: it grabs the first (to be more specific, the EARLIEST nonmissing value) and fixes it across all waves.

Let me know if you have any questions and difficulties adapting it to the real data. Good luck!

andkov commented 7 years ago

@casslbrown , actully, i'm wrong. It selects "the smallest of of the values", not "the earliest". I need to think a bit.

andkov commented 7 years ago

@casslbrown , this is the solution you need. First sort on id and time with dplyr::arrange and then exclude the NA from the vector with setdiff:

ds <- data.frame(
  "id" = c(1,1,1,2,2,2,3,3,3,4,4,4),
  "time" = c(1,2,3,1,2,3,1,2,3,1,2,3),
  "value" = c(4,NA,4,NA,5,5, NA,7,6,NA,NA,NA),
  "string" = c("A",NA,"A",NA,"B","B", NA, "D","C",NA,NA,NA)
)

ds <- ds %>% 
  dplyr::group_by(id) %>% 
  dplyr::arrange(id,time) %>% 
  dplyr::mutate(
    value_fixed =  setdiff(unique(value),NA)[1],
    string_fixed = setdiff(unique(string),NA)[1]
  ) %>% 
  dplyr::ungroup()
> ds
# A tibble: 12 × 6
      id  time value string value_fixed string_fixed
   <dbl> <dbl> <dbl> <fctr>       <dbl>        <chr>
1      1     1     4      A           4            A
2      1     2    NA     NA           4            A
3      1     3     4      A           4            A
4      2     1    NA     NA           5            B
5      2     2     5      B           5            B
6      2     3     5      B           5            B
7      3     1    NA     NA           7            D
8      3     2     7      D           7            D
9      3     3     6      C           7            D
10     4     1    NA     NA          NA         <NA>
11     4     2    NA     NA          NA         <NA>
12     4     3    NA     NA          NA         <NA>

Note also, that I changed the data slightly for id==3 in order to demonstrate that the selection is done despite the sorting order.

casslbrown commented 7 years ago

Thank you! This worked well and it is good for me to learn the format of a reproducible example.