StirlingCodingClub / studyGroup

Gather together a group to skill-share, co-work, and create community
http://StirlingCodingClub.github.io/studyGroup/
Other
2 stars 1 forks source link

Merging Dataframes Using Multiple Columns #20

Open John-Paterson opened 5 years ago

John-Paterson commented 5 years ago

Hi - I'm trying to pull data from one dataframe and append it to another based on matching values in multiple other columns. Currently I'm trying to achieve this through the merge function. My first dataframe looks like this:

> head(percentilesdf)
                     Species Year   5  50  95 ModelYear
1           Baccha elongata. 1981 122 191 305         1
2 Cheilosia albitarsis agg.. 1981 123 151 173         1
3      Cheilosia illustrata. 1981 155 199 235         1
4          Cheilosia pagana. 1981  81 157 253         1
5 Chrysogaster solstitialis. 1981 170 190 298         1
6     Chrysotoxum bicinctum. 1981 164 196 244         1

And my second looks like this:

 >head(climate.data1)
# A tibble: 6 x 6
# Groups:   Year [6]
  date         Day Month  Year  JDay cum.degrees
  <fct>      <int> <int> <int> <int>       <dbl>
1 01/01/1960     1     1  1960     1        9.48
2 01/01/1961     1     1  1961     1        4.71
3 01/01/1962     1     1  1962     1        0   
4 01/01/1963     1     1  1963     1        0   
5 01/01/1964     1     1  1964     1        6.00
6 01/01/1965     1     1  1965     1        4.44

I'm trying to pull the cum.degrees value from the second (climate.data1) df and and join it with the first (percentilesdf) df based on matching values in the Year column (which is shared between both dfs) and matching the '5' column in percentilesdf to the 'JDay' column in climate.data1.

Currently I am trying to do this using a merge function that specifies which columns I want to match from each df using by.x and by.y and which specifies that I want the "cum.degrees" column merged when defining y. Code is as follows:

merge(x = percentilesdf, y = climate.data1[ , "cum.degrees"], by.x = c('Year', '5'), by.y=c('Year', 'JDay'), all.x=TRUE)

However I keep getting error messages saying Error in fix.by(by.x, x) : 'by' must match numbers of columns

or that 'by' must include unique columns. I have looked extensively on the internet for a solution but can't find anything that works.

Am I along the right lines with this code or is there a different way to achieve this?

Any help appreciated as I've spent more time than I care to admit trying to sort this out!

mattnuttall00 commented 5 years ago

Hey John, I'm certainly no expert, but to tackle the first desired merge have you tried a simple merge to create a new df like this:

new_df <- merge(percentilesdf, climate.data1,by="year")

resulting in a merged dataframe which you can then drop any columns you don't need. That should pull in all the columns from climate.data1 (including cum.degrees) into the new df.

For the second bit, I'm not sure what you mean by wanting to match '5' and 'JDay'. With the above merge, both of those columns should be in the new df anyway (I think!?)

Have you looked into left_join() from dplyer? There's a bunch of handy _join() functions in dplyer, and the help page is actually quite helpful!

Matt

John-Paterson commented 5 years ago

Thanks Matt,

My issue is primarily that I'm not looking to just join the dataframes together (they're different lengths anyway), I'm wanting to pull a column value from the second one based on whether values from two other columns match those in the original dataframe. Basically analogous to a Vlookup in Excel, if that means anything to anyone!

When I said I wanted to match '5' and 'JDay' I meant that I only wanted to merge the 'cum.degrees' column from the second dataframe if both the 'Years' columns matched across the dataframes AND the value in the '5' and 'JDay' columns match. It would look something like this:

>DF1

                   Species   Year   5  50  95 ModelYear
1           Baccha elongata. 1981 117 191 305         1
2           Baccha elongata. 1995 101 151 173         1
3          Cheilosia pagana. 1995 117 199 235         1
4          Cheilosia pagana. 1995 101 199 235         1

>DF2

  date         Day Month  Year   JDay cum.degrees
1 18/04/1981    18     4  1981     95        59.48
2 12/05/1981    12     5  1981     117       104.71
3 12/05/1995    12     5  1995     117       86.1  
4 18/04/1995    18     4  1995     95        43.23   
5 19/04/1995    19     4  1995     96        46.00
6 26/04/1995    26     4  1995     101      54.44

>MergedDFs

                   Species   Year   5  50  95 ModelYear cum.degrees
1           Baccha elongata. 1981 117 191 305         1 104.71
2           Baccha elongata. 1995 101 151 173         1 54.44
3          Cheilosia pagana. 1995 117 199 235         1 86.1
4          Cheilosia pagana. 1995 101 199 235         1 54.44

Don't know if that makes it clearer? I have had a look through the dplyr page but will check it out again!

Cheers,

John

John-Paterson commented 5 years ago

So I've managed to circumvent this problem by creating a column in each dataframe containing a unique reference combining both Year and '5' (in the first dataframe) and 'JDay' (in the second dataframe). I've then merged them based on that.

Still don't really know why the original code didn't work though so if anyone's got any answers it'd be good to know for future reference.

anna-deasey commented 5 years ago

I think left_join would do it - not sure if this code would exactly work - but something along these lines probably would. I think in recent updates dplyr have added even more join options - could be a one stop shop for all your joining needs :)

left_join(df1, df2, by = Year, c("JDay" = "5"))

John-Paterson commented 5 years ago

Thanks Anna, just had to change the code a wee bit to:

left_join(df1, df2, by = c("Year", "JDay" = "5"))

And it worked a treat. With 'Year' outside the brackets it didn't look for both columns to match, it just matched them based on Year. I should have listened to Matt - Tidyverse is always easier!

Thanks all.

mattnuttall00 commented 5 years ago

@jejoenje showed me another quite snazzy method for doing a similar thing to what you wanted. I know you've solved your issue but this may be of interest/use to people in the future. The code below won't be bang on as I'm not sure how you'd do it when you need to match more than one column, but you'll get the idea

percentilesdf$cum.degrees <- climate.data1$cum.degrees[match(percentilesdf$year, climate.data1$year)]