Aging and Autism Study Data Repository & Codebook

ben-domingue commented 2 months ago

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8825232/

KingArthur0205 commented 2 months ago

I believe this is the correct link(https://osf.io/dejgn/) to the paper above. The link in the above cell, which is also included in the paper, leads to a more comprehensive data repository, I will first finish the datasets for this paper and move on to process the larger datasets.

Edit: After examining the data for this paper, I believe it is already included as part of the larger datasets. Hence, I believe it would make sense to directly process the larger datasets.

Code to examine common rows:

library(dplyr)
library(tidyr)
library(haven)

df <- read.csv("./aqhealth.csv")
df2 <- read.csv("./aqALL.csv")

selected_columns <- grep("^aq([1-9]|[1-4][0-9]|50)$", names(df), value = TRUE)
selected_columns2 <- grep("^aq([1-9]|[1-4][0-9]|50)$", names(df2), value = TRUE)

# Subset the data frame
df_filtered2 <- df2[, selected_columns2]
# Subset the data frame with the selected columns
df_filtered <- df[, selected_columns]

common_rows <- merge(df_filtered, df_filtered2)

KingArthur0205 commented 2 months ago

@ben-domingue Could you clarify this for me please?

If a participant’s responses are entirely NAs, should we delete that row since it doesn’t provide any useful information? The large data repository on Autism contains large amount of NAs.

KingArthur0205 commented 2 months ago

Data will continue to be released and the code will come afterwards. :)

Dataset 1: Autism-Spectrum Quotient Scale & Attachment AASDR_Lodi-Smith_2021_AQ.csv This dataset consists of 50 questionnaire items with responses from 1,139 participants. It includes two sets of responses: aqx and aqxcont, where x is a number of the sequence 1 to 50.

aqx and aqxcont are responses for the same set of questions. aqx is derived from aqxcont by mapping the 4-point scale in aqxcontto a 2-point scale in aqx. Specifically, responses 1 and 2 in aqxcont are mapped to 0 in aqx, and responses 3 and 4 are mapped to 1. The processed responses are on 4-point scale.

Attachment Dataset: There is also an additional dataset for the aging group that has 37 items. The participants are identical to the dataset above, and it is merged into one as they are both about measure on Autism. 205 out of 378 participants didn't attempt this attachment questionnarie and are thus removed.

Dataset 2: Alexithymia AASDR_Lodi-Smith_2021_Alexithymia.csv This dataset consists of 20 items and 143 participants. The alexmean column is deleted. All responses are on a 4-point scale.

Dataset 3: Big Five Inventory-2 AASDR_Lodi-Smith_2021_BFI2.csv This dataset consists of 325 participants and 60 items. It also includes columns that ends with _r, which is the reverse of the actual responses from participants and thus deleted. All responses are on a 5-point scale.

Dataset 4: BRFSS This dataset is a bit complicated as it contains inconsistent scales. Will come back to it later.

Dataset 5: Clinical Personality RedCAP Survey AASDR_Lodi-Smith_2021_CPS.csv This dataset consists of 188 participants and 25 items on a 4-point scale. Deleted some columns of aggregate statistics.

Dataset 6: Dark Triad AASDR_Lodi-Smith_2021_DT.csv This dataset consists of 544 participants and 27 items on a 5-point scale. However, 257 out of 544 participants had all NA responses and are thus deleted from the dataset.

Dataset 7: Desired Trait Change(TIPI) AASDR_Lodi-Smith_2021_TIPI.csv This dataset includes 412 participants and 10 items measured on a 5-point scale. The original scale ranged from -2 to 2, but to ensure consistency, it has been shifted to a 1 to 5 scale. There are 29 participants with all NA responses.

Dataset 8: Goals AASDR_Lodi-Smith_2021_Goals.csv This dataset includes 802 participants with 23 items on a 4-piont scale. 263 out of 802 participants had all NA responses and were excluded.

Dataset 9: Grit AASDR_Lodi-Smith_2021_Grit.csv This dataset includes 378 participants with 8 items on a 5-point scale. 190 out of 378 participants had all NA responses and were thus excluded.

Dataset 10: Lonelines AASDR_Lodi-Smith_2021_Loneliness.csv This dataset only includes responses for 3 items of a larger questionnarie, focusing on evaluation of loneliness of the aging group. It includes 378 participants and 3 items on a 3-point scale. 58 participants had all NA responses and were excluded.

Dataset 11: PROMIS AASDR_Lodi-Smith_2021_PROMIS.csv This dataset includes 29 items and 378 participants on a 5-point scale. 25 participants had all NA responses and were excluded.

Dataset 12: Ryff AASDR_Lodi-Smith_2021_RYFF.csv This dataset includes 15 items and 337 participants on a 6-point scale.

Dataset 13: Purpose AASDR_Lodi-Smith_2021_Purpose.csv This dataset includes 6 items and 802 participants on a 5-point scale. 257 participants had all NA responses are were excluded.

Dataset 14: Ressillience AASDR_Lodi-Smith_2021_Resillience.csv This dataset includes 6 items and 378 participants on a 6-point scale. 232 participants had all NA responses.

Dataset 15: Satisfaction with Life AASDR_Lodi-Smith_2021_SWLS.csv This dataset includes 6 items and 335 participants on a 7-point scale.

Dataset 16: Self-Concept Clarity AASDR_Lodi-Smith_2021_SCC.csv The questionnaire includes 22 items from this and dataset 17 below. The dataset includes 12 questions and 802 participants on a 5-points cale from the 1st section of the questionnarie. This dataset has 118 participants with all NA responses.

Dataset 17: Self-Esteem AASDR_Lodi-Smith_2021_RSE.csv This dataset includes 10 items and 802 participants on a 4-point scale. This dataset has 136 participants with all NA responses.

Dataset 18: Social Camouflage AASDR_Lodi-Smith_2021_CATQ.csv This dataset includes 25 items and 136 participants on a 7-point scale.

Dataset 19: Social Investment AASDR_Lodi-Smith_2021_SI.csv This dataset includes 28 items and 378 participants on a 6-point scale. 195 participants had all NA responses and are excluded from the dataset.

ben-domingue commented 2 months ago

@KingArthur0205 yeah we can generally delete NA responses (esp when they occur in large numbers)

and good eye on the larger dataset!

KingArthur0205 commented 2 months ago

@ben-domingue This repository includes various measures for three distinct populations: Aging, Students, and MTurk. I suggest maybe creating separate datasets for each measure, as they target different aspects of the participants.

ben-domingue commented 2 months ago

oof. this is big. i might need more time to dig into it than i have right now but i'm also happy with you attempting a cut. my general feeling is that:

we can collapse different samples into the same dataset when the measure is the same
we split the same sample into different datasets when the measure is different

so i'd cut out the demographics here but, as some other examples:

ASQ scale: put them all togther.
ten item: have students & mturkers together. and so on. honestly, i'd be happy just doing what this table suggests.

ben-domingue commented 2 months ago

sweet dataset thoug!

KingArthur0205 commented 2 months ago

@ben-domingue Finished processing this repository. Code and data below. I have also updated detailed notes of each dataset in the cell above. :)

I still haven't processed dataset 4 yet because it includes inconsistent scales, FAQs, and etc.

Zipped Version: processed data.zip

Code:

# Paper:
# Data: https://osf.io/mwszy/
library(dplyr)
library(tidyr)
library(haven)

# Remove participants whose responses are all NAs
remove_na <- function(df) {
  df <- df[!(rowSums(is.na(df[, -which(names(df) == "id")])) == (ncol(df) - 1)), ]
  return(df)
}

# ------ Process Autism-Spectrum Quotient Scale Dataset ------ 
aq_df <- read.csv("./aqALL.csv")
aq_columns <- grep("^aq([1-9]|[1-4][0-9]|50)cont$", names(aq_df), value = TRUE)
aq_columns <- append(aq_columns, "ID")
aq_df <- aq_df[, aq_columns] # Select only detailed responses.

aq_df <- aq_df %>% rename(id=ID)
aq_df <- remove_na(aq_df)
aq_df <- pivot_longer(aq_df, cols=-id, names_to="item", values_to = "resp")

attach_df <- read.csv("./attachmentAGING.csv")
attach_df <- attach_df |>
  select(-ends_with("reversed"), -X, -ends_with("scale")) |>
  rename(id=ID)
attach_df <- remove_na(attach_df)
attach_df <- pivot_longer(attach_df, cols=-id, names_to="item", values_to = "resp")

aq_df <- rbind(aq_df, attach_df)

save(aq_df, file="AASDR_Lodi-Smith_2021_AQ.Rdata")
write.csv(aq_df, "AASDR_Lodi-Smith_2021_AQ.csv", row.names=FALSE)

# ------ Process Alexithymia Dataset ------ 
alex_df <- read.csv("./alexithymiaAGING.csv")
alex_df <- alex_df |>
  select(-X, -alexmean) |>
  rename(id=ID)
alex_df <- pivot_longer(alex_df, cols=-id, names_to="item", values_to="resp")

save(alex_df, file="AASDR_Lodi-Smith_2021_Alexithymia.Rdata")
write.csv(alex_df, "AASDR_Lodi-Smith_2021_Alexithymia.csv", row.names=FALSE)

# ------ Process Big Five Inventory-2 Dataset ------
bfi2_df <- read.csv("./traitsAGING.csv")
bfi2_df <- bfi2_df |>
  select(-X, -starts_with("autism"), -ends_with("r"), -C, -A, -E, -O, -ES) |>
  rename(id=ID)
bfi2_df <- remove_na(bfi2_df)
bfi2_df <- pivot_longer(bfi2_df, cols=-id, names_to="item", values_to="resp")

save(bfi2_df, file="AASDR_Lodi-Smith_2021_BFI2.Rdata")
write.csv(bfi2_df, "AASDR_Lodi-Smith_2021_BFI2.csv", row.names=FALSE)

# ------ Process BRFSS Dataset ------
brfss_df <- read.csv("./brfssAGING.csv")
brfss_df <- brfss_df |>
  select(-X)

# ------ Process Clinical Personality Survey Dataset ------ 
cps_df <- read.csv("./clinicalpersonalityAGING.csv")
cps_df <- cps_df |>
  select(ID, starts_with("pid")) |>
  rename(id=ID)
cps_df <- remove_na(cps_df)
cps_df <- pivot_longer(cps_df, cols=-id, names_to="item", values_to="resp")

save(cps_df, file="AASDR_Lodi-Smith_2021_CPS.Rdata")
write.csv(cps_df, "AASDR_Lodi-Smith_2021_CPS.csv", row.names=FALSE)

# ------ Process Dark Triad Dataset ------
dt_df <- read.csv("./darktraidALL.csv")
dt_df <- dt_df |>
  select(ID, starts_with("s")) |>
  rename(id=ID)
dt_df <- remove_na(dt_df)
dt_df <- pivot_longer(dt_df, cols=-id, names_to="item", values_to="resp")

save(dt_df, file="AASDR_Lodi-Smith_2021_DT.Rdata")
write.csv(dt_df, "AASDR_Lodi-Smith_2021_DT.csv", row.names=FALSE)

# ------ Process Desired Trait Change(TIPI) ------
tipi_df <- read.csv("./tipichangeALL.csv")
tipi_df <- tipi_df |>
  select(-X, -ends_with("change")) |>
  rename(id=ID)
tipi_df[ , -which(names(tipi_df) == "id")] <- tipi_df[ , -which(names(tipi_df) == "id")] + 3
tipi_df <- remove_na(tipi_df)
tipi_df <- pivot_longer(tipi_df, cols=-id, names_to="item", values_to="resp")

save(tipi_df, file="AASDR_Lodi-Smith_2021_TIPI.Rdata")
write.csv(tipi_df, "AASDR_Lodi-Smith_2021_TIPI.csv", row.names=FALSE)

# ------ Process Goals Dataset ------
goals_df <- read.csv("goalsAll.csv")
goals_df <- goals_df |>
  select(-X) |>
  rename(id=ID)
goals_df <- remove_na(goals_df)
goals_df <- pivot_longer(goals_df, cols=-id, names_to="item", values_to="resp")

save(goals_df, file="AASDR_Lodi-Smith_2021_Goals.Rdata")
write.csv(goals_df, "AASDR_Lodi-Smith_2021_Goals.csv", row.names=FALSE)

# ------ Process Grit Dataset ------
grit_df <- read.csv("gritAGING.csv")
grit_df <- grit_df[ , c("ID", paste0("grit", 1:8))]
grit_df <- grit_df |>
  rename(id=ID)
grit_df <- remove_na(grit_df)
grit_df <- pivot_longer(grit_df, cols=-id, names_to="item", values_to="resp")

save(grit_df, file="AASDR_Lodi-Smith_2021_Grit.Rdata")
write.csv(grit_df, "AASDR_Lodi-Smith_2021_Grit.csv", row.names=FALSE)

# ------ Process Loneliness Dataset ------
loneliness_df <- read.csv("./lonelinessAGING.csv")
loneliness_df <- loneliness_df[, c("ID", paste0("loneliness", 1:3))]
loneliness_df <- loneliness_df |>
  rename(id=ID)
loneliness_df <- remove_na(loneliness_df)
loneliness_df <- pivot_longer(loneliness_df, cols=-id, names_to="item", values_to="resp")

save(loneliness_df, file="AASDR_Lodi-Smith_2021_Loneliness.Rdata")
write.csv(loneliness_df, "AASDR_Lodi-Smith_2021_Loneliness.csv", row.names=FALSE)

# ------ Process PROMIS Dataset ------
promis_df <- read.csv("./promisAGING.csv")
promis_df <- promis_df |>
  select(ID, starts_with("promis29")) |>
  rename(id=ID)
promis_df <- remove_na(promis_df)
promis_df <- pivot_longer(promis_df, cols=-id, names_to="item", values_to="resp")

save(promis_df, file="AASDR_Lodi-Smith_2021_PROMIS.Rdata")
write.csv(promis_df, "AASDR_Lodi-Smith_2021_PROMIS.csv", row.names=FALSE)

# ------ Process Ryff Dataset ------ 
ryff_df <- read.csv("./pwbAGING.csv")
ryff_cols <- c(2, grep("[0-9]$", names(ryff_df)))
ryff_df <- ryff_df[ , ryff_cols]
ryff_df <- ryff_df |>
  rename(id=ID)
ryff_df <- remove_na(ryff_df)
ryff_df <- pivot_longer(ryff_df, cols=-id, names_to="item", values_to="resp")

save(ryff_df, file="AASDR_Lodi-Smith_2021_RYFF.Rdata")
write.csv(ryff_df, "AASDR_Lodi-Smith_2021_RYFF.csv", row.names=FALSE)

# ------ Process Purpose Dataset ------
purpose_df <- read.csv("./purposeALL.csv")
purpose_df <- purpose_df[, c("ID", paste0("spm", 1:6))]
purpose_df <- purpose_df |>
  rename(id=ID)
purpose_df <- remove_na(purpose_df)
purpose_df <- pivot_longer(purpose_df, cols=-id, names_to="item", values_to="resp")

save(purpose_df, file="AASDR_Lodi-Smith_2021_Purpose.Rdata")
write.csv(purpose_df, "AASDR_Lodi-Smith_2021_Purpose.csv", row.names=FALSE)

# ------ Process Resillience Dataset ------
resillience_df <- read.csv("resilienceAGING.csv")
resillience_df <- resillience_df[, c("ID", paste0("brs_", 1:6))]
resillience_df <- resillience_df |>
  rename(id=ID)
resillience_df <- remove_na(resillience_df)
resillience_df <- pivot_longer(resillience_df, cols=-id, names_to="item", values_to="resp")

save(resillience_df, file="AASDR_Lodi-Smith_2021_Resillience.Rdata")
write.csv(resillience_df, "AASDR_Lodi-Smith_2021_Resillience.csv", row.names=FALSE)

# ------ Process Satisfaction with Life Dataset ------
swls_df <- read.csv("swlsAGING.csv")
swls_df <- swls_df[, c("ID", paste0("swls", 1:5))]
swls_df <- swls_df |>
  rename(id=ID)
swls_df <- remove_na(swls_df)
swls_df <- pivot_longer(swls_df, cols=-id, names_to="item", values_to="resp")

save(swls_df, file="AASDR_Lodi-Smith_2021_SWLS.Rdata")
write.csv(swls_df, "AASDR_Lodi-Smith_2021_SWLS.csv", row.names=FALSE)

# ------ Process Self-concept Clarity Dataset ------
scc_df <- read.csv("sccALL.csv")
scc_df <- scc_df[, c("ID", paste0("scc", 1:12))]
scc_df <- scc_df |>
  rename(id=ID)
scc_df <- remove_na(scc_df)
scc_df <- pivot_longer(scc_df, cols=-id, names_to="item", values_to="resp")

save(scc_df, file="AASDR_Lodi-Smith_2021_SCC.Rdata")
write.csv(scc_df, "AASDR_Lodi-Smith_2021_SCC.csv", row.names=FALSE)

# ------ Process Self-Esteem Dataset ------
rse_df <- read.csv("selfesteemALL.csv")
rse_df <- rse_df |>
  select(-X, -rsesmean) |>
  rename(id=ID)
rse_df <- remove_na(rse_df)
rse_df <- pivot_longer(rse_df, cols=-id, names_to="item", values_to="resp")

save(rse_df, file="AASDR_Lodi-Smith_2021_RSE.Rdata")
write.csv(rse_df, "AASDR_Lodi-Smith_2021_RSE.csv", row.names=FALSE)

# ------ Process Social Camouflage Dataset ------
catq_df <- read.csv("./camouflageAGING.csv")
catq_df <- catq_df[, c("ID", paste0("catq", 1:25))]
catq_df <- catq_df |>
  rename(id=ID)
catq_df <- remove_na(catq_df)
catq_df <- pivot_longer(catq_df, cols=-id, names_to="item", values_to="resp")

save(catq_df, file="AASDR_Lodi-Smith_2021_CATQ.Rdata")
write.csv(catq_df, "AASDR_Lodi-Smith_2021_CATQ.csv", row.names=FALSE)

# ------ Process Social Investment Dataset ------
worksi_df <- read.csv("socialinvestmentAGING.csv")
worksi_df <- worksi_df |>
  select(-X, -ends_with("mean")) |>
  rename(id=ID)
worksi_df <- remove_na(worksi_df)
worksi_df <- pivot_longer(worksi_df, cols=-id, names_to="item", values_to="resp")

save(worksi_df, file="AASDR_Lodi-Smith_2021_SI.Rdata")
write.csv(worksi_df, "AASDR_Lodi-Smith_2021_SI.csv", row.names=FALSE)

ben-domingue commented 2 months ago

Fantastic. Don't worry about those remaining 4; anything of marginal quality we should drop. send a pull request @KingArthur0205

KingArthur0205 commented 2 months ago

PR for this issue: https://github.com/ben-domingue/irw/pull/321

ben-domingue commented 2 months ago

thank you for posting the zipped version of the data. so much easier.

KingArthur0205 commented 2 months ago

thank you for posting the zipped version of the data. so much easier.

I will then do so in the future.(both zipped and individual datasets) ;)

ben-domingue commented 2 months ago

Email sent to first author re license.

ben-domingue / irw

Aging and Autism Study Data Repository & Codebook #205