ben-domingue / irw

Code related to data for the Item Response Warehouse
https://datapages.github.io/irw/
6 stars 8 forks source link

The Perceptions of Bilingualism Scales #212

Closed ben-domingue closed 1 week ago

ben-domingue commented 1 week ago

https://osf.io/preprints/psyarxiv/s32zb https://osf.io/yw7t2/?view_only=

KingArthur0205 commented 1 week ago

This paper includes 2 datasets for 2 distinct studies.

Note: Although the items follow a similar naming format, the asked questions are totally different in the 2 datasets. Details can be found in Table 2 and Table 7 in the preprint.

Dataset 1 has a questionnarie that meausres how adults in the US perceive the value of billingualism. It includes a total of 422 participants from 2 samples(210+212), which aligns with the given dataset.

The following columns are deleted as they are either demographic data or aggregate statistics: race, bornUS, yrs_US, imm_age, region, KIDtotalpts(Knowledge of Infant Development Index Total Points), femchild, p1spks_ch(The proportion of English and another language that Parent 1 speaks to child), p2spks_ch, L1EngNoL2, L1_derived(First Language), L2_derived(Second Language), and p1spksch_imputed(Imputing Only English (1) for all parents who said that their child was not exposed to a language other than English.)

Dataset 2 has a questionnaire that meausres how adults in the US perceive the value of billingualism for their children. The paper says that there is a total of 321 participants, whereas there are only 319 participants in the provided dataset.

The following columns are deleted as they are either demographic data or aggregate statistics: yearsed(years of education), spansurv(if they took survey in Spanish), bornUS, region, zipcode, homeusage, langback, age_cat, meanPOB10, perclote_2016(2016 percentage that speaks a language other than English), logplote_2016(2016 percentage that speaks English), mturk, meanPOBp_16item, tell_stories, talkchild, sing, and read.

KingArthur0205 commented 1 week ago

Data: PBS_Surrain_2019_PoB.csv

Code:

# Paper: https://osf.io/preprints/psyarxiv/s32zb
# Data: https://osf.io/yw7t2/?view_only=
library(dplyr)
library(tidyr)
library(haven)

# ------ Process Dataset 1 ------
study1_df <- read_dta("./2019_07_07_PoB_OSF.dta")
colnames(study1_df) <- gsub("\\s*\\(.*\\)", "", colnames(study1_df)) # Remove column labels
study1_df <- lapply(study1_df, function(x) { attr(x, "label") <- NULL; x })
study1_df <- as.data.frame(study1_df)

study1_df <- study1_df |>
  select(-qualtrics, -mturk, -yearsed, -p_ed, -spansurv, -female, -bornUS, 
         -yrs_us, -imm_age, -region, -KIDItotalpts, -femchild, 
         -p1spks_ch, -p2spks_ch, -parent, -L1EngNoL2, -p1spksch_imputed, 
         -L1_derived, -L2_derived, -race) |>
  rename(age=p_age)
study1_df <- pivot_longer(study1_df, cols=-c(id, age), names_to="item", values_to="resp")

# ------ Process Dataset 2 ------
study2_df <- read_dta("./PoB_TPS_2022_OSF_keyvars.dta")
colnames(study2_df) <- gsub("\\s*\\(.*\\)", "", colnames(study2_df)) # Remove column labels
study2_df <- lapply(study2_df, function(x) { attr(x, "label") <- NULL; x })
study2_df <- as.data.frame(study2_df)

study2_df <- study2_df |>
  select(-yearsed, -spansurv, -female, -bornUS, -region, -zipcode,
         -homeusage, -langback, -perclote_2016, -logplote_2016, -mturk, 
         -meanPOBp_6item, -age_cat, -tellstories, -talkchild, -sing, -read,
         -MUNE_reads_dich, -RQ2_sample, -meanPOB10) |>
  rename(id=ResponseID, age=p_age)
study2_df <- pivot_longer(study2_df, cols=-c(id, age), names_to="item", values_to="resp")

# ------ Process Merged Datasets
study1_df$id <- as.character(study1_df$id)
df <- bind_rows(
  study1_df %>% mutate(group = "Study 1"),
  study2_df %>% mutate(group = "Study 2")
)

save(df, file="PBS_Surrain_2019_PoB.Rdata")
write.csv(df, "PBS_Surrain_2019_PoB.csv", row.names=FALSE)
ben-domingue commented 1 week ago

I'm a little confused here. They seem to have an 8 item scale (POB+) and a 13 item scale (POB). When I look at items I see and 17 for POB+ and 22 for POB. Something is off. I'm noting a 'age' and 'race' item which are definitely wrong but I'm not entirely sure how we went from 8->17 and 13->22.

@KingArthur0205 flagging some issues

KingArthur0205 commented 1 week ago

@ben-domingue This dataset contains a set of "POB" questions and an additional set of 8 "POB+" questions for each of the datasets, which are the POBPlus_x columns in the dataset.

For study2, thereare 10 POB questions. Adding 8 POB+, this makes a total of 18 items For study 1, there are 13 POB questions. Adding 8 POB+ items, this makes a total of 21 items.

POBPlus

Both the POB questions and the POB+ questions apply a 6-point scale.

ben-domingue commented 1 week ago

Two questions:

  1. age and race still need to go though, correct?
  2. are the POB+ items the same across both? if so i might be inclined to jam this into a single dataset (while perhaps including a little info about the sample).

if the answer to 2 is yes let me look at the raw data a little more. thank you!!

KingArthur0205 commented 1 week ago
  1. age and race still need to go though, correct?

Yes, sorry I need to wake up. This has already been corrected in the cell above.

  1. are the POB+ items the same across both? if so i might be inclined to jam this into a single dataset (while perhaps including a little info about the sample).

Yes, the 2 POB+ questions are the same across 2 datasets. However, the POB questions are different. Perhaps we can merge the POB+ questions into an additional dataset. Participants in study 1 didn't take POB5+ and POB7+, but the rest POB+ questions are identical

ben-domingue commented 1 week ago

i'd actually vote to just put them all in the same dataset. let me make a general observation as it might help. suppose we have 3 blocks of items, x1 x2 x3 and two groups of people g1 g2. if we then have a case wherein

,x1,x2,x3  
g1,X,X,0  
g2,0,X,X  

such that

there are lots of ways of 'equating' things such that all the items are on the same scale. this depends on the x1 and x3 items being somewhat interchangeable which i'd argue that they are here (more or less). thus i'd be inclined to just chunk everything together.

KingArthur0205 commented 1 week ago

Understood. Then, in this way we might need to change the naming of the POB questions. Maybe insert their labels(full questions) into the names of those PoB items :)

KingArthur0205 commented 1 week ago

Code and data updated in the post above. I have added an extra column group to indicate their different origins.

ben-domingue commented 1 week ago

this structure is great! image

ben-domingue commented 1 week ago

done and i added code. nice!

KingArthur0205 commented 1 week ago

done and i added code. nice!

Thx! I was just about to create PR. :)

KingArthur0205 commented 1 week ago

Should probably do it faster next time:)

ben-domingue commented 1 week ago

no worries ;) i don't know how messy the PR is. i can also add code from my end if it is work on your side.

KingArthur0205 commented 1 week ago

no worries ;) i don't know how messy the PR is. i can also add code from my end if it is work on your side.

No no no, it doesn't bother me at all. I actually enjoying creating PR cz it gives me this weird sense of achievement of getting somethign done. :)

ben-domingue commented 1 week ago

ha! i shall leave it to you then