American-Institutes-for-Research / EdSurvey

https://american-institutes-for-research.github.io/EdSurvey/
GNU General Public License v2.0
9 stars 8 forks source link

countries = "*" Function #58

Closed nirmalghimire closed 1 year ago

nirmalghimire commented 1 year ago

Thanks for using EdSurvey! Please follow the instructions below when requesting a new feature in EdSurvey.

Is your feature request related to a problem? Please describe. Currently, when using the countries = "*" function in EdSurvey to analyze PISA 2018 data, it returns a country-wise data set. However, I have a need to obtain a compiled data set instead.

Describe the solution you'd like I would appreciate it if a new functionality could be added to the EdSurvey package that allows users to easily compile data from multiple countries. This would enable us to combine the data sets and perform analysis on the combined data. Specifically, I would like to be able to use the fascinating functions available in the package after combining the datasets.

Describe alternatives you've considered I have attempted to use the do.call(rbind, ()) function to combine the data sets obtained for different countries. However, this approach did not work on the edsurvey.dataframe objects. Therefore, I am seeking an alternative solution or feature within the EdSurvey package itself to achieve the desired data compilation.

Additional context Here's my code snippet: eds_pisa <- EdSurvey::readPISA(path = "path/PISA/2018", database = "INT", countries = "*", cognitive = "score", verbose = FALSE)

nirmalghimire commented 1 year ago
# initialize an empty list to store data frames
df_list <- list()

for(i in 1:length(eds_pisa)) {
  # convert each edsurvey data frame list to a data frame
  df <- EdSurvey::edsurvey.data.frame(eds_pisa[[i]], pvvars = c("read"))
  # store each data frame in the list
  df_list[[i]] <- df
}

# bind all data frames in the list into a single data frame
compiled_data <- do.call(rbind, df_list)

1 2 [edited by PV to put code in code block.]

tomfink commented 1 year ago

Greetings @nirmalghimire!

Thanks for your inquiry. The main issue you are running into is that in the code you provided, you appear to be trying to rebuild edsurvey.data.frames when they are already built within the edsurvey.data.frame.list generated from the readPISA function. I think once you better understand the edsurvey.data.frame.list object that will help with the issues you are having. Additional details can be found in the EdSurvey User Guide here.

From using the countries = "*" argument for the readPISA call, it will return an edsurvey.data.frame.list object which contains all the countries already as edsurvey.data.frame objects within it: eds_pisa <- EdSurvey::readPISA(path = "path/PISA/2018", database = "INT", countries = "*", cognitive = "score", verbose = FALSE)

edsurvey.data.frame.list objects are a list of two components: 1) datalist which is a list that has all of the edsurvey.data.frame objects within it (80 in the case of PISA 2018). 2) covs which is a data.frame containing the covariates of the list items in the datalist for this edsurvey.data.frame.list object.

All EdSurvey analysis functions work with edsurvey.data.frame.lists and will return a list of the result objects.

Using summary2 function for example passing it the edsurvey.data.frame.list object directly (easiest method):

summaryRes <- summary2(data = eds_pisa, variable = "read")
names(summaryRes) <- eds_pisa$covs$country #name the result items by country

#print the result to console
summaryRes

#remove results that had an error/no data (Vietnam in this instance)
summaryRes$VIETNAM <- NULL

#extract just the summary data.frame from result list
summaryListDF <- lapply(summaryRes, function(x){
                            x$summary
                        })
summaryStacked <- do.call(rbind, summaryListDF)
summaryStacked$Country <- names(summaryRes)

View(summaryStacked)

If you wish to have more fine-grain control you can loop through the edsurvey.data.frame.list item-by-item as demonstrated below. It is more complex to do so, but allows for the most user control.

resList <- vector("list", length = length(eds_pisa$datalist))
summaryStacked <- NULL

for(i in seq_along(eds_pisa$datalist)){

  esdf <- eds_pisa$datalist[[i]] #grab one edsurvey.data.frame at a time
  cntry <- eds_pisa$covs$country[[i]] #grab the country name from the covariates

  tryCatch({summaryRes <- summary2(data = esdf, variable = "read")},
           error = function(e){
             message(paste0(cntry, " skipped. Error: ", e))
             summaryRes <- NULL
           })

  if(is.null(summaryRes)){
    next
  }

  summaryDF <- summaryRes$summary
  summaryDF$Country <- cntry
  summaryStacked <- rbind(summaryStacked, summaryDF)
}

View(summaryStacked)
tomfink commented 1 year ago

Also, related to your inquiry we have experienced very bag lag/slowness/crashing when dealing with the full PISA 2018 dataset when using RStudio. We are still investigating but a workaround would be to use another IDE other than RStudio (e.g., RGui, or VSCode), or we had success with the 'Electron' preview of RStudio.

pdbailey0 commented 1 year ago

This looks resolved to me. @nirmalghimire let us know if you have any other questions