TuomoNieminen / Helsinki-Open-Data-Science

A DataCamp course for the University of Helsinki
https://www.datacamp.com/courses/helsinki-open-data-science
11 stars 10 forks source link

IODS-error-in-joining-datasets-and-SOLUTION-by-Reijo-Sund #8

Open avlehtim opened 2 years ago

avlehtim commented 2 years ago

This concerns the Logistic regression chapter:


Joining the datasets has not worked perfectly. Reijo Sund noticed this. See the detailed solution given by Reijo in his GitHub. This should be corrected in the DataCamp code and the instructions + in the RStudio Exercise.

Some messages from the IODS2020 forum:

alc.txt data - Exercise 3 Anne P - maanantai, 9 marraskuu 2020, 14:18 Vastausten määrä: 4 Hi,

we were told today by Reijo "Please note that for Exercise #3 in Datacamp the joining of datasets is not perfect. Please see the following code to see that there are actually 370 unique individuals instead of 382 in the datasets"

If I take the data for the analysis from:

http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt then there is 382 obs. of 35 variables. Is it okey to use that data? I did the data wrangling part but I am not sure if I did it correctly so I would like to use the data that is actually correct to do the analysis :)


Vs: alc.txt data - Exercise 3 Reijo Sund - maanantai, 9 marraskuu 2020, 16:38 If you want to use the data with 370 observations, do the wrangling part as shown in https://github.com/rsund/IODS-project/raw/master/data/create_alc.R. Actually that creates an excel file, so you may want to save it as a .txt or .csv instead, or read the excel file in the analysis part with readxl::read_excel()-function.

If you want to read the wrangled data directly, use the data available in https://github.com/rsund/IODS-project/raw/master/data/alc.csv.

You can also directly load the data in R: alc <- readr::read_csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv")

Please note that for variables failures, paid, absences, G1, G2, and G3 there are also variables with extra .p and .m in their names containing the original values from both datasets and you may consider if there is better way to combine those than to calculate means (or taking the first values).


Vs: alc.txt data - Exercise 3 Anne P - maanantai, 9 marraskuu 2020, 17:52 Thank you for the answer!


Re: Vs: alc.txt data - Exercise 3 Andrei K - keskiviikko, 11 marraskuu 2020, 09:52 Hi! I made joining of two data sets by creating unique ID based on variables given in task ("school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", "reason", "nursery","internet"). Same way was used for por and mat data sets. Then, I excluded replicates within each data set and merged two sets. If student was presented twice both obs were removed.

This reveals only 358 observations, not 370! If I would NOT exclude replicates, than number is 382 which in match to the task, but not correct, according to Monday meeting and your e-mail.

If we assume that your R code correct, why "join_cols" is different to joining variables given in the Task?

Mess in tasks and datacamp codes consuming time =(


Vs: Re: Vs: alc.txt data - Exercise 3 Reijo Sund - maanantai, 16 marraskuu 2020, 09:25

Read the metadata related to the datasets. There are a some free variables and then common fixed variables. To join two datasets correctly, you need to take into account all common fixed variables, because there may be duplicate values in subsets of common fixed variables. That is why unique identifiers, such as personal identity code in Finland or its psedonymized version or research number, would help a lot in joining datasets.

And in data wrangling it is very common that you need to deal with messy datasets. Unfortunately Datacamp exercises were constructed before the problem was detected during last year's course. But that (variables that should be used in the joining of datasets) certainly should be corrected for the task description.

For actual logistic regression part, it will be allowed to use any version of the data (of course you get a bit different results, but still reasonable close to each other). Actually it would be interesting task to compare how much results will change between different versions of wrangled data.