ddotta / parquetize

R package that allows to convert databases of different formats to parquet format
https://ddotta.github.io/parquetize/
62 stars 4 forks source link

table_to_parquet: SPSS-file is not correctly converted to .parquet when it has user defined missings #40

Closed Schakel17 closed 1 year ago

Schakel17 commented 1 year ago

Default argument "user_na" of haven::read_sav and haven::read_spss is FALSE. I would like to have the option to overrule this argument or set the default value to TRUE as user defined missings in the .sav file are currently converted to NA.

nbc commented 1 year ago

Hi @Schakel17,

I'm not familiar with SAV and SPSS formats. Can you send a sample of such file? (a small one)

Schakel17 commented 1 year ago

Hi @ddotta

I took a file from the SPSS sample folder with user defined missings (e.g., variable 61, named "reason"). To take into account user defined missings I do the following with the R-package haven: df <- read_sav(filename, user_na = TRUE). Only SPSS allows user defined missings, according to the haven-package. I have no experience with SAS and Stata. If user_na = FALSE (default value of haven), haven will convert all user defined missings to NA. Btw, PSPP can be used to view the attached sample file, but it is inferior tot SPSS.

Sample file: customer_dbase.zip

ddotta commented 1 year ago

Hi @Schake17 and thanks for using parquetize,

With your sample SPSS file :

library(haven)
tableF <- haven::read_sav(filename,col_select = "reason", user_na = FALSE)
tableT <- haven::read_sav(filename, col_select = "reason", user_na = TRUE)
> class(tableF$reason)
[1] "haven_labelled" "vctrs_vctr" "double"        
> class(tableT$reason)
[1] "haven_labelled_spss" "haven_labelled" "vctrs_vctr" "double"

I have the same NA in R but I guess it works with this? We could add a specific SPSS format parameter to table_to_parquet(), it's understandable. What do you think @nbc?

Schakel17 commented 1 year ago

Hi @ddotta,

I think you do not want specific functions for SPSS, SAS, and Stata. Then I would prefer a specific SPSS format argument which is only effective when the input file has the extension .sav.

Schakel17 commented 1 year ago

Hi @ddotta,

I checked the adjusted function table_to_parquet() with user_na=TRUE for an SPSS-file. I receive the following error: "Error in write_parquet(data, sink = path_to_parquet, compression = compression, : unused argument (user_na = TRUE)". What is the problem?

ddotta commented 1 year ago

Hi @Schakel17,

I can't reproduce it. This code works on my computer.

table_to_parquet(path_to_file = "U:/customer_dbase.sav",
                 path_to_parquet = "U:/customer_dbase.parquet",
                 user_na = TRUE)

Are you sure you have the latest version of parquetize? And can you provide a reproducible example?

Schakel17 commented 1 year ago

Hi @ddotta,

I think I did not have the lastest version. After deleting the old installation I don't receive any error anymore. The output is exactly the same as the input.

ddotta commented 1 year ago

Great news!