Open ofajardo opened 2 years ago
here another file with a similar issue, this file has apparently been created using SPSS (not the dlls as in the previous example). Here the variable which name has been truncated is XC0DAB1_1 (truncated to XC0DAB1), it is the variable in position 85 (counting from 1). Again pspp reads the variable correctly.
here another file with a similar issue, this file has apparently been created using SPSS (not the dlls as in the previous example). Here the variable which name has been truncated is XC0DAB1_1 (truncated to XC0DAB1), it is the variable in position 85 (counting from 1). Again pspp reads the variable correctly.
Yes - I opened it in SPSS and saved it again.
I am experiencing an error that seems related to this.
I am sorry, but I cannot share the (customer's) data file, and haven't been able (had the time) to generate a synthesized example file that triggers the bug. However, I have been able to narrow down the issue a little bit:
pyreadstat
(which has 487 columns and 530348 rows), let's call it file_broken.sav
. Some of the columns/variables have names with Norwegian letters, like “æ”, “ø”, and “å”. When writing that file, the input column to the pyreadstat.write_sav()
function is named "forn_1" with lowercase letters (checked in-memory with debugger). file_broken.sav
has been written to disk, the column has (automatically) been renamed to "FORN_1" in uppercase (this is strange). I have checked this by reading the file with both readstat
and pyreadstat
.file_broken.sav
with the SPSS program, the variable is shown as "forn_1" in lowercase, and if I save the exact same file from SPSS as file_ok.sav
, the variable on disk is no longer in uppercase. So, trying to see if the error is caused by pyreadstat
or readstat
, I tried the following, using a (freshly) compiled (C) readstat
and extract_metadata
binaries:
./extract_metadata file_ok.sav file_ok-metadata.json
file_ok-metadata.json
. Hence, reading the file seems to work, and writing the metadata separately../readstat file_ok.sav file_ok.csv
Converted 489 variables and 88013 rows in 4.49 seconds
Error processing file_ok.sav: Unable to convert string to the requested encoding (invalid byte sequence)
./readstat file_broken.sav file_broken.csv
sed 's/"FORN_1"/"forn_1"/g' file_broken.csv > file_ok.csv
readstat
(the C version) by combining data and metadata into new file:
./readstat file_ok.csv file_ok-metadata.json output.sav
./extract_metadata output.sav output-metadata.json
output-metadata.json
shows that the variable is now named FORN_1
.When I wrote this I was surprised by the error during my first attempt at converting data from .sav to .csv. I guess I will inspect the data file around row 88013.
I am sorry that I cannot provide a reproducible error report, but thought that this might shed some light on where to look for the cause of this bug.
When reading the attached file, there should be a variable name "BRANDAA_SUN_1", I get instead "BRANDAA". PSPP can read the variable name correctly. I think the file has been created using the IBM spss dll files instead of the full application. If the file is opened in spss and saved, then it is read correctly. I have tested with a simple C program that the issue is indeed coming from Readstat:
test.SAV.zip
original report: https://github.com/Roche/pyreadstat/issues/165