StirlingCodingClub / studyGroup

Gather together a group to skill-share, co-work, and create community
http://StirlingCodingClub.github.io/studyGroup/
Other
2 stars 1 forks source link

Replacing values in one DF with values from another DF by matching #37

Open mattnuttall00 opened 4 years ago

mattnuttall00 commented 4 years ago

Hi all,

What I am trying to do is quite simple, and I thought I knew how to do it, but it's not working...!

I have two dataframes. Below are reproducible dataframes which demonstrate the example:

df1 <- data.frame(year = rep(c(2007,2008,2009,2010,2011), each=2), comm_name = c("a","b",NA,"d","e",NA,"g","h",NA,"j"), code_com = c(1:10))

df2 <- data.frame(comCode = c(1:10), comm_name = letters[1:10])

All I am trying to do is to replace the NAs in df1$comm_name with the value from df2$comm_name, based on the com_code. i.e. code_com in df1 and comCode in df2 are the same variable, so I want to use those values to match the the correct comm_name from df2 to df1.

I have tried: df1$comm_name[is.na(df1$comm_name)] <df2$comm_name[match(df1$code_com,df2$comCode)]

and

df1$comm_name <- ifelse(is.na(df1$comm_name), df2$comm_name[match(df2$comCode,df1$code_com)], df1$comm_name)

and

df1$comm_name <- replace(df1$comm_name, is.na(df1$comm_name), df2$comm_name[match(df1$code_com,df2$comCode)])

and

df1 <- df1 %>% mutate(comm_name = ifelse(is.na(df1$comm_name), df2$comm_name[match(df1$code_com,df2$comCode)], comm_name))

and

df1 <- df1 %>% mutate(comm_name = ifelse(is.na(df1$comm_name), replace(comm_name, is.na(comm_name), df2$comm_name[match(df1$code_com,df2$comCode)]), comm_name))

And none of them are doing what I want. Not sure if I am over-complicating it, or if I am just making some silly mistake in my code....but all suggestions welcome!

Thanks

jejoenje commented 4 years ago

Hi Matt,

I think this does want you want it to do?

df1 <- data.frame(year = rep(c(2007,2008,2009,2010,2011), each=2), 
                  comm_name = c("a","b",NA,"d","e",NA,"g","h",NA,"j"), 
                  code_com = c(1:10))

df2 <- data.frame(comCode = c(1:10), comm_name = letters[1:10])

df1$comm_name = df2$comm_name[match(df1$code_com, df2$comCode)]
df1

Basically, match() returns indexes where it found a match between vector A and B, so you need to use the output of that function as an "indexer" for where you want to place the values. It's a bit odd to get your head around, but have a look at what the output of match(df1$code_com, df2$comCode) is.

Hope that helps...

J

mattnuttall00 commented 4 years ago

Thanks @jejoenje. Am I correct in thinking though that df1$comm_name = df2$comm_name[match(df1$code_com, df2$comCode)] replaces all of the df1$comm_name values with the values from df2$comm_name? How would I then specify that I only wanted it do do that when df1$comm_name values are NA?

jejoenje commented 4 years ago

Ah, yes, it does indeed replace all of them. In the example case, though, that doesn't matter, as your "index" (i.e. df2) has all the necessary values. I assume that for your "real" case, that isn't the case (perhaps only the missing cases)?

The only way I could think of doing this in that case, would be to effectively subset df1 into two, one where df1$comm_Name == NA, and one where df1$comm_Name != NA. Then do the above match using the former subset only, and re-combine the two subsets. Watch your factor levels (if applicable) in that case, though.

The result, though, for the given example would be identical... it would be far easier to just ensure that your "index" (df2) just has all the cases you're trying to match? ... or is my cold-addled brain missing something?

All else fails: for loop and some ifelse statements. But that would, of course, be cheating. :)

anna-deasey commented 4 years ago

I think using a combination of left_join() and coalesce() would do what you want - you get two versions of the col 'comm_name' with the join, then coalesce (merge them together) to get a new col with NA's replaced, and then delete the extra cols you don't need. It didn't work with factors though. Does that help?

df1$comm_name <- as.character(df1$comm_name) df2$comm_name <- as.character(df2$comm_name)

df3 <- left_join(df1, df2, by = c('code_com' = 'comCode')) %>% mutate(comm_name = coalesce(comm_name.x, comm_name.y)) %>% select(-comm_name.x, -comm_name.y)

mattnuttall00 commented 4 years ago

Hey @jejoenje and @anna-deasey - both of those methods work. Thanks very much!