WizardMac / ReadStat

Command-line tool (+ C library) for converting SAS, Stata, and SPSS files 💾
MIT License
274 stars 70 forks source link

Error 5 on certain sas7bdat files #20

Closed quantbabies closed 9 years ago

quantbabies commented 9 years ago

The library gives an error 5 for certain (very large) sas7bdat files. I haven't been able to pin down the problem myself. Files can be read with https://pypi.python.org/pypi/sas7bdat

Any tips on the kinds of issues I should be thinking about?

evanmiller commented 9 years ago

Are the files compressed? Currently we don't support BINARY compression.

quantbabies commented 9 years ago

Here's the header from the python package sas7bdat

col_count_p1: 10
col_count_p2: 0
column_count: 10
compression: None
creator: None
creator_proc: CONNECT
date_created: 2009-10-10 13:16:16.041934
date_modified: 2009-10-10 13:16:16.041934
endianess: little
file_type: DATA
filename: somedata.sas7bdat
header_length: 8192
lcp: 7
lcs: 0
mix_page_row_count: 87
name: somedata.sas7bdat
os_name: x86_64
os_type: 2.6.18-92.1.22.e
page_count: 2987773
page_length: 8192
platform: unix
row_count: 433226881
row_length: 56
sas_release: 9.0201M0
server_type: Linux
u64: True
evanmiller commented 9 years ago

Do you have a (smaller) file you can send me that demonstrates the problem? Since it's a general parse error I have no idea what the problem might be. (Use readstat_error_message to convert the error code to a string.)

quantbabies commented 9 years ago

I wish I could share. If you are interested, I have dug in a bit. I apologize for no line numbers, I've made a few changes to the code.

One file reads fine until near the end of the file at which point it sets retval = READSTAT_ERROR_PARSE just after the declaration of sas_parse_catalog_page near line 700 readstat_sas.c

The second file doesn't read at all. It sets retval = READSTAT_ERROR_PARSE near like 860 of readstat_sas.c

  if (len > 0 && compression != SAS_COMPRESSION_TRUNC) {
        if (offset > page_size || offset + len > page_size ||
                offset < off+24+subheader_count*lshp) {
            retval = READSTAT_ERROR_PARSE;
            printf("Foobar 19");
            /*
            goto cleanup;
            */

At this point compression = 64, but the file is not compressed (at least according to the python package, which does not call either of its decompression algorithms when reading the file).

Of course I get a seg fault when I try to let it run through here.

I'm wondering if somewhere one of the int_16's is wrapping around.

evanmiller commented 9 years ago

I received similar reports from other users and was able to track down the issue. An integer overflow was producing a negative offset in the file, which caused things to go haywire at around the 2GB mark. This commit ought to fix it:

https://github.com/WizardMac/ReadStat/commit/1bae419a277c5f81aa84427e68a17b406a62d263

evanmiller commented 9 years ago

Are you using the latest code? That code snippet looks out of date. I've made additional fixes in the last month which may resolve the issue for you.