Closed ben-domingue closed 2 weeks ago
This paper includes 2 studies. The 1st study includes 2 datasets named Study1_CEA and Study1_EFA, corresponding to the data collected for exploratory factor analysis and confirmatory factor analysis. The 2 datasets used the same measure(set of items) and is thus merged together.
Study 2 includes a dataset with identical items. However, it contains repetitive IDs for different participants. For example, ID 1 is used 5 times. I am modifying the IDs to be id_age_gender to uniquely identify each participant. @ben-domingue Please let me know if this assumption is correct. ;)
Both studies include 3 subscales: personality, cognition, and activities. I have separated them into 3 different dfs for both datasets.
Data: GERAS_Gruber_2019.csv
Code:
# Paper:https://econtent.hogrefe.com/doi/10.1027/1015-5759/a000528
# Data:https://osf.io/42jhr/
library(dplyr)
library(tidyr)
library(haven)
# ------ Process Study 1 -------
study1_cfa_df <- read_sav("./GERAS_Study1_CFA.sav")
study1_efa_df <- read_sav("./GERAS_Study1_EFA.sav")
study1_df <- rbind(study1_cfa_df, study1_efa_df) # Merge 2 datasets
study1_df <- study1_df |>
select(-gender) |>
rename(id=ID)
study1_df <- study1_df %>% # Replace encoded missing values with NA
mutate_all(~replace(., . %in% c(-66, -77, -99), NA))
# ------ Process Study 2 -------
study2_df <- read_sav("./GERAS_Study2_CFA.sav")
colnames(study2_df) <- gsub("\\s*\\(.*\\)", "", colnames(study2_df))
study2_df <- lapply(study2_df, function(x) { attr(x, "label") <- NULL; x })
study2_df <- as.data.frame(study2_df)
study2_df <- study2_df %>%
mutate(VPN = paste(VPN, gender, age, sep = "_"))
study2_df <- study2_df |>
select(-gender) |>
rename(id=VPN)
study2_df <- study2_df %>% # Replace encoded missing values with NA
mutate_all(~replace(., . %in% c(-66, -77, -99), NA))
# ------ Process Merged Data ------
study1_df$id <- as.character(study1_df$id)
merged_df <- bind_rows(
study1_df %>% mutate(group = "Study 1"),
study2_df %>% mutate(group = "Study 2")) # Merge datasets from the 2 studies
pivot_longer(merged_df, cols=-c(id, age, group), names_to="item", values_to = "resp")
save(merged_df, file="GERAS_Gruber_2019.Rdata")
write.csv(merged_df, "GERAS_Gruber_2019.csv", row.names=FALSE)
a few questions/notes:
- all three have 1913 IDs which i'm guessing is more or less the sum of studies 1 and 2 (1466+471 is a little more than 1913 but that's ok). if we're on the same page i think i'm ok with your solution.
I think the total No. of participants is correct. However, there are repetitive IDs in the 2 studies.(and multiple repetitive IDs in Study 2 alone) To avoid this, I encoded the participants' IDs in Study 2 to be id_age_gender and added a group
column to differentiate repetitive IDs between Study 1 and Study 2
- the three subscales are all part of the same GERAS measure i think. is that right? if so, i would put them together. when to split and when to lump is more science than art but i think here we want to lump. to give an example, if we had an academic test with math and reading, i'd want to split. if it was a math test with algebra and geometry, i'd want to lump. if all 3 subscales are assessing the same gender attitudes construct, i'd lump (but perhaps have info on the subscales [maybe in item names?]).
Yes, now I understand that we don't split all datasets. This is a good counter-example. :)
The code and datasets are updated above. :)
PR for this usse: https://github.com/ben-domingue/irw/pull/203
i think this CSV got output as 'wide' rather than 'long'. most of the columns must be the items, yeah?
@KingArthur0205 i think this requires a tweak
This is an oversight on myend. I should have double-checked more carefully. Sorry for the mistake.
I have updated the cell above, the CSV file, and the PR to have long format.
oh no worries! honestly, finding the occasional error makes me feel like i'm adding value! ;)
the coding is 2-6. double checking that was true in the original data. it's fine if so just wanted to be sure we didn't lose the 1 values. @KingArthur0205
the coding is 2-6. double checking that was true in the original data. it's fine if so just wanted to be sure we didn't lose the 1 values. @KingArthur0205
Ye, I just double checked the original datasets and the paper. The study adopted a 7-point scale, and the original, unprocessed datasets didn't have any 1s or 7s. They didn't seem to apply any techniques to remove 1s and 7s either.
Data and R scripts are published in the Open Science Framework (see https://osf.io/42jhr/).
https://econtent.hogrefe.com/doi/10.1027/1015-5759/a000528