Closed casslbrown closed 7 years ago
Very well formulated question, @casslbrown. This should be easy. In fact, let me take this opportunity to demonstrate how to create a reproducible example. This type of question is expected at www.stackoverflow.com, so getting the form down should empower you to have the advice of more skilled programmers.
First, create a the simplest data object possible. It should have just enough complexity to test all the functionionality that is desired:
ds <- data.frame(
"id" = c(1,1,1,2,2,2,3,3,3,4,4,4),
"time" = c(1,2,3,1,2,3,1,2,3,1,2,3),
"value" = c(4,NA,4,NA,5,5, NA,6,7,NA,NA,NA),
"string" = c("A",NA,"A",NA,"B","B", NA, "C","D",NA,NA,NA)
)
Another, more convenient way ,to offer people whose consult you seek the data object is to use the dput
function to produce the following code:
> dput(ds)
structure(list(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4), time = c(1,
2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3), value = c(4, NA, 4, NA, 5,
5, NA, 6, 7, NA, NA, NA), string = structure(c(1L, NA, 1L, NA,
2L, 2L, NA, 3L, 4L, NA, NA, NA), .Label = c("A", "B", "C", "D"
), class = "factor"), value_fixed = c(4, 4, 4, 5, 5, 5, 6, 6,
6, NA, NA, NA), string_fixed = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, NA, NA, NA), .Label = c("A", "B", "C", "D"), class = "factor")), class = c("tbl_df",
"tbl", "data.frame"), .Names = c("id", "time", "value", "string",
"value_fixed", "string_fixed"), row.names = c(NA, -12L))
In either way, this produces the following data frame:
> ds
id time value string
1 1 1 4 A
2 1 2 NA <NA>
3 1 3 4 A
4 2 1 NA <NA>
5 2 2 5 B
6 2 3 5 B
7 3 1 NA <NA>
8 3 2 6 C
9 3 3 7 D
10 4 1 NA <NA>
11 4 2 NA <NA>
12 4 3 NA <NA>
Note that I have included both numeric and character examples. Also note, that the four individuals, for whom I have created observation offer full range of potential cases. We want to make sure that our function handles all probable cases.
Although there are MANY ways to do anything in R, here's the solution I came up with:
ds <- ds %>%
dplyr::group_by(id) %>%
dplyr::mutate(
value_fixed = sort(unique(value))[1],
string_fixed = sort(unique(string))[1]
) %>%
dplyr::ungroup()
> ds
# A tibble: 12 × 6
id time value string value_fixed string_fixed
<dbl> <dbl> <dbl> <fctr> <dbl> <fctr>
1 1 1 4 A 4 A
2 1 2 NA NA 4 A
3 1 3 4 A 4 A
4 2 1 NA NA 5 B
5 2 2 5 B 5 B
6 2 3 5 B 5 B
7 3 1 NA NA 6 C
8 3 2 6 C 6 C
9 3 3 7 D 6 C
10 4 1 NA NA NA NA
11 4 2 NA NA NA NA
12 4 3 NA NA NA NA
Note that sort()
removes NA
s:
> a <- c(NA,2,NA,4, NA,3)
> sort(a)
[1] 2 3 4
which comes handy in this case.
It appears that this produces exactly the result you've been looking for: it grabs the first (to be more specific, the EARLIEST nonmissing value) and fixes it across all waves.
Let me know if you have any questions and difficulties adapting it to the real data. Good luck!
@casslbrown , actully, i'm wrong. It selects "the smallest of of the values", not "the earliest". I need to think a bit.
@casslbrown , this is the solution you need. First sort on id
and time
with dplyr::arrange
and then exclude the NA
from the vector with setdiff
:
ds <- data.frame(
"id" = c(1,1,1,2,2,2,3,3,3,4,4,4),
"time" = c(1,2,3,1,2,3,1,2,3,1,2,3),
"value" = c(4,NA,4,NA,5,5, NA,7,6,NA,NA,NA),
"string" = c("A",NA,"A",NA,"B","B", NA, "D","C",NA,NA,NA)
)
ds <- ds %>%
dplyr::group_by(id) %>%
dplyr::arrange(id,time) %>%
dplyr::mutate(
value_fixed = setdiff(unique(value),NA)[1],
string_fixed = setdiff(unique(string),NA)[1]
) %>%
dplyr::ungroup()
> ds
# A tibble: 12 × 6
id time value string value_fixed string_fixed
<dbl> <dbl> <dbl> <fctr> <dbl> <chr>
1 1 1 4 A 4 A
2 1 2 NA NA 4 A
3 1 3 4 A 4 A
4 2 1 NA NA 5 B
5 2 2 5 B 5 B
6 2 3 5 B 5 B
7 3 1 NA NA 7 D
8 3 2 7 D 7 D
9 3 3 6 C 7 D
10 4 1 NA NA NA <NA>
11 4 2 NA NA NA <NA>
12 4 3 NA NA NA <NA>
Note also, that I changed the data slightly for id==3
in order to demonstrate that the selection is done despite the sorting order.
Thank you! This worked well and it is good for me to learn the format of a reproducible example.
In 2-laundry-room. R I'd like to force the education variable degree to be the same across waves for each participant similar to this:
but taking the first value only if it is not missing and if it is missing taking the first available non-missing value and forcing it across waves for each hhidpn.