WizardMac / ReadStat

Command-line tool (+ C library) for converting SAS, Stata, and SPSS files 💾
MIT License
277 stars 70 forks source link

cannot read correctly variable name #268

Open ofajardo opened 2 years ago

ofajardo commented 2 years ago

When reading the attached file, there should be a variable name "BRANDAA_SUN_1", I get instead "BRANDAA". PSPP can read the variable name correctly. I think the file has been created using the IBM spss dll files instead of the full application. If the file is opened in spss and saved, then it is read correctly. I have tested with a simple C program that the issue is indeed coming from Readstat:

#include "readstat.h"

int handle_metadata(readstat_metadata_t *metadata, void *ctx) {
    int *my_count = (int *)ctx;

    *my_count = readstat_get_row_count(metadata);

    return READSTAT_HANDLER_OK;
}

int handle_variable(int index, readstat_variable_t *variable, char *val_labels, void *ctx)
{
    char * var_name;
    var_name = readstat_variable_get_name(variable);
    printf("Variable: %s\n", var_name);
}

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    int my_count = 0;
    readstat_error_t error = READSTAT_OK;
    readstat_parser_t *parser = readstat_parser_init();
    readstat_set_metadata_handler(parser, &handle_metadata);
    readstat_set_variable_handler(parser, &handle_variable);

    error = readstat_parse_sav(parser, argv[1], &my_count);

    readstat_parser_free(parser);

    if (error != READSTAT_OK) {
        printf("Error processing %s: %d\n", argv[1], error);
        return 1;
    }
    printf("Found %d records\n", my_count);
    return 0;
}

test.SAV.zip

original report: https://github.com/Roche/pyreadstat/issues/165

ofajardo commented 10 months ago

here another file with a similar issue, this file has apparently been created using SPSS (not the dlls as in the previous example). Here the variable which name has been truncated is XC0DAB1_1 (truncated to XC0DAB1), it is the variable in position 85 (counting from 1). Again pspp reads the variable correctly.

original report

CRO_MX.zip

zenelba commented 10 months ago

here another file with a similar issue, this file has apparently been created using SPSS (not the dlls as in the previous example). Here the variable which name has been truncated is XC0DAB1_1 (truncated to XC0DAB1), it is the variable in position 85 (counting from 1). Again pspp reads the variable correctly.

Yes - I opened it in SPSS and saved it again.

mtr commented 9 months ago

I am experiencing an error that seems related to this.

I am sorry, but I cannot share the (customer's) data file, and haven't been able (had the time) to generate a synthesized example file that triggers the bug. However, I have been able to narrow down the issue a little bit:

  1. I write a file using pyreadstat (which has 487 columns and 530348 rows), let's call it file_broken.sav. Some of the columns/variables have names with Norwegian letters, like “æ”, “ø”, and “å”. When writing that file, the input column to the pyreadstat.write_sav() function is named "forn_1" with lowercase letters (checked in-memory with debugger).
  2. When file_broken.sav has been written to disk, the column has (automatically) been renamed to "FORN_1" in uppercase (this is strange). I have checked this by reading the file with both readstat and pyreadstat.
  3. If I open file_broken.sav with the SPSS program, the variable is shown as "forn_1" in lowercase, and if I save the exact same file from SPSS as file_ok.sav, the variable on disk is no longer in uppercase.

So, trying to see if the error is caused by pyreadstat or readstat, I tried the following, using a (freshly) compiled (C) readstat and extract_metadata binaries:

  1. Extract the OK metadata:
    ./extract_metadata file_ok.sav file_ok-metadata.json
  2. Verify that only "forn_1" and not "FORN_1" is present in the file_ok-metadata.json. Hence, reading the file seems to work, and writing the metadata separately.
  3. (First [failed] attempt) Create a CSV version of the datafile:
    ./readstat file_ok.sav file_ok.csv
    Converted 489 variables and 88013 rows in 4.49 seconds
    Error processing file_ok.sav: Unable to convert string to the requested encoding (invalid byte sequence)
  4. (Second attempt) Create a CSV version of the datafile by manually renaming from "FORN_1" -> "forn_1":
    ./readstat file_broken.sav file_broken.csv
    sed 's/"FORN_1"/"forn_1"/g' file_broken.csv > file_ok.csv
  5. Isolate writing of the file using readstat (the C version) by combining data and metadata into new file:
    ./readstat file_ok.csv file_ok-metadata.json output.sav
  6. Extracting the new metadata:
    ./extract_metadata output.sav output-metadata.json
  7. Opening the new output-metadata.json shows that the variable is now named FORN_1.

When I wrote this I was surprised by the error during my first attempt at converting data from .sav to .csv. I guess I will inspect the data file around row 88013.

I am sorry that I cannot provide a reproducible error report, but thought that this might shed some light on where to look for the cause of this bug.