DS4PS / cpp-526-spr-2021

Course shell for Foundations of Data Science I
https://ds4ps.org/cpp-526-spr-2021/
MIT License
1 stars 2 forks source link

Lab 6 - Q2 #15

Open AhmedRashwanASU opened 3 years ago

AhmedRashwanASU commented 3 years ago

Any Recommended videos to clear the concepts will be highly appreciated @jamisoncrawford

jamisoncrawford commented 3 years ago

@AhmedRashwanASU recommended videos for join/merge functions? (This week's material?).

AhmedRashwanASU commented 3 years ago

@AhmedRashwanASU recommended videos for join/merge functions? (This week's material?).

Yes Sir %>%

AhmedRashwanASU commented 3 years ago

Lab 6 - Q2

using the below codes, I tried compiling the same, however, the results seem different, could you please help to define the differences between each? also, I did use compound id keys, was this a correct step to be more precise?

Salaries %>% right_join(Teams)

image

fulldata <-merge(Salaries,Teams , by= c("teamID","yearID") , all.y = TRUE)

full data

image

jamisoncrawford commented 3 years ago

Recommended videos are a bit older but they should help you, especially with the lab assignment - there are 3 you can find in this playlist:

https://youtube.com/playlist?list=PLdoNxtuC-qSMzLlq93u_e6tTUI8fwFdux

For your merge() call, you didn't include the common variable lgID, so because R saw the same variable name but you didn't specify that they were, indeed, the same, it created an "x" and "y" version, and likely duplicated many rows!

The beauty of dplyr joins are that they automatically detect the common variable names and send a message re: which variables they use. Does this help?

AhmedRashwanASU commented 3 years ago

Yup, 1- so one of the primary keys between the two sets was lgID ? 2- I was supposed to use more common variables such lgID in my second code in order to get the desired results? 3- is the primary key = common variable ?

jamisoncrawford commented 3 years ago

Sorry for the delay on this, though when I'd initially read it I felt confident you understood! 🔥

Just to reiterate and reconfirm:

1- so one of the primary keys between the two sets was lgID?

Correct! More or less, though we could get into the differences between primary, secondary, and foreign keys. For our purposes, this more than suffices.

2- I was supposed to use more common variables such lgID in my second code in order to get the desired results?

That's exactly right. Since lgID (league ID, I assume) has the same variable name in both data tables, but you've not specified in merge() that they are, in fact, the same variables and should be matched, R will treat them as different variables that coincidentally have the same name. In so doing, R renames them to differentiate them: lgID.x and lgID.y.

Now here is where things get out of hand - since these are seen as different variables, lgID.x in object Teams and lgID.y in object Salaries have all the other information the same, like the shared/common variables teamID and yearID. Normally, these would merge to create a single, combined record, since what variables exist in both sets are specified as being shared. However, since there is a "different" (according to your merge() specifications) value for lgID.x and lgID.y, it's now going to create two records instead of a single unified record, or at least that's what I suspect is happening (or would happen).

Imagine that two families, say two life partners: Stefan and Fatimah, are going to have a family reunion. Stefan is related to Fatimah's family be being a brother- and son-in-law, as well as an uncle, etc. Fatimah is related to Stefan's family by being a sister- and daughter-in-law, as well as aunt, etc. In order to merge them for our family reunion, we must specify that Fatimah is a member of both families (and thus shared), and Stefan is also a member of both families, and thus shared. This is exactly wat happens with Fatimah because, well, everyone loves Fatimah. Stefan, however, is quite forgettable.

Ordinarily, we would see a single family getting together for this reunion. However, Stefan wasn't identified by both families as being a common member. Stefan's family knows he's a part of them, but because Fatimah's family basically forgot about him. When Fatimah's family realizes that, indeed, some person named Stefan is also part of their family, they need to account for the family reunion now having two Stefans - so they designate a Stefan.x and a Stefan.y.

This creates an issue, because, where there would normally be one unified family, there are now two families. One is the exact same family except with Stefan.x, the other the exact same family with Stefan.y. As a result, there is now a duplicate of every family member. There are now two Aunt Aishas, two Uncle Anders, two Cousin Hannes, etc.

That was a really long and roundabout way to say: Failing to identify shared variables in a merge will result in duplication!

3- is the primary key = common variable ?

Yep! I mean "common" as if to say "shared" or "a commonality".

AhmedRashwanASU commented 3 years ago

Sadly, Stefan, has no luck being part of the data devotees team .

This is a clear example thank you !!