Roche / pyreadstat

Python package to read sas, spss and stata files into pandas data frames. It is a wrapper for the C library readstat.
Other
321 stars 60 forks source link

reading .sav file with pyreadstat.read_file_multiprocessing() fails if first record is SPSS Date Type #109

Closed jttaggart closed 3 years ago

jttaggart commented 3 years ago

Please read the README, particularly the known limitations section!

If first record is SPSS Date format pyreadstat.read_file_multiprocessing() fails.

To Reproduce See attached files.

File example See attached files.

Setup Information: Platform (Windows 10 64 bit) Python Version 3.7 Using venv pyread_multi.zip

ofajardo commented 3 years ago

Thanks. Apparently the zip contains only the python script. Would you be so kind of including the SPSS file as well?

jttaggart commented 3 years ago

Ooops…

ofajardo commented 3 years ago

hi, just in case I still can't access the spss file ...

jttaggart commented 3 years ago

Hi,

I’ll upload on github.

JT

ofajardo commented 3 years ago

Yes, please do upload the file, I cannot reproduce the issue

jttaggart commented 3 years ago

I uploaded again yesterday.

ofajardo commented 3 years ago

I cant see it.

jttaggart commented 3 years ago

I just attached… .zip file

Let me know if you get it.

ofajardo commented 3 years ago

No, I dont see any zip file. I think attaching from email doesnt work, you have to attach from the github webpage

jttaggart commented 3 years ago

Doing that now…

jttaggart commented 3 years ago

Uploading files... pyread_multi.zip

jttaggart commented 3 years ago

I accidentally closed!

ofajardo commented 3 years ago

No prob. Now I can see the zip. Thanks!

jttaggart commented 3 years ago

Thank God!!!

ofajardo commented 3 years ago

hi

I cannot replicate the issue on ubuntu. Meaning, this runs good:

import pandas as pd
import pyreadstat

#savFile = "pyread_multi.sav" # this is also fine
savFile = "pyread_multi_1.sav"

df, meta = pyreadstat.read_sav(savFile, user_missing=False)
print("normal is good!")
df2, meta2 = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav,
                                                  savFile,
                                                  num_processes=1,
                                                  user_missing=False)

print("multi is good")

I'll try on windows.

Please copy and paste here the full stacktrace you get, and please make sure you are using the latest version of pyreadstat (1.0.8)

ofajardo commented 3 years ago

OK, I do see an error on windows:

RuntimeError: 
            Attempt to start a new process before the current process
            has finished its bootstrapping phase.
            This probably means that you are on Windows and you have
            forgotten to use the proper idiom in the main module:
                if __name__ == '__main__':
                    freeze_support()
                    ...
            The "freeze_support()" line can be omitted if the program
            is not going to be frozen to produce a Windows executable.

The way to solve this is explained in the Readme here in the "Notes for windows" section, basically you have to include a "if name == main":

import pandas as pd
import pyreadstat

if __name__ == "__main__":
    #savFile = "pyread_multi.sav"
    savFile = "pyread_multi_1.sav"

    df, meta = pyreadstat.read_sav(savFile, user_missing=False)
    print("normal is good!")
    df2, meta2 = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav,
                                                      savFile,
                                                      num_processes=1,
                                                      user_missing=False)

    print("multi is good")

also see here

So that should solve the issue. Is it good?

I am a little bit intrigued by you saying that it fails if the first record is a date. Does that mean that it works for you if the type is not a date? If yes can you provide an example file?

jttaggart commented 3 years ago

Hi Otto,

Yes, I’m using 1.08

I’ll send you my complete application (there are a lot of imports).

I used multiprocessing in order to have “spinner” to stop users bashing away at the keyboard when reading file.

SPSSedit_v3.0.py

When application opens click on Open button and select spss.sav

Yes, I’m running on Windows 10 64 bit.

pyreadstat.read_file_multiprocessing()

is on Line 1596

What is very strange is now reading both files I sent you which are subsets of spss.sav attached (_1 with date as 1st variable). It’s definitely failing on spss.sav. I tried several times to be absolutely sure. I used to work in support myself many decades ago.

I’ll let you know when I have uploaded to github.

Files uploaded. pyreadstat_multi.zip

Regards,

JT

ofajardo commented 3 years ago

I don't see any file.

I still don't understand what is your error (since you have not copy pasted it in here). If your error message is what I showed before, then its a general pythonn limitation on windows and you have to solve it as I indicated or find some workaround as it is not a pyreadtat issue but much beyond this libary.

If it is something else please copy and paste the full stacktrace and submit a reproducible example. Execute your script in the command line, dont use any IDE or double click on it.

Regarding the example, please make it minimal ... I won't debug a full application but only code regarding pyreadstat.

And if possible stop replying from email, it generates a very verbose thread that is difficult to read, and in addition your files are not attached. Write directly into the webpage.

jttaggart commented 3 years ago

I have used file attached spss.sav. With you script it is failing. pyread_multi.zip normal is good Traceback (most recent call last): File "C:/Users/johntaggart/Documents/PycharmProjects/DataEditor/pyread_multi.py", line 12, in user_missing=False) File "pyreadstat\pyreadstat.pyx", line 685, in pyreadstat.pyreadstat.read_file_multiprocessing Exception: The number of rows of the file cannot be determined

Process finished with exit code 1

ofajardo commented 3 years ago

thanks a lot for the error message and the example, now it is very clear to me what is happening:

in order to do the multiprocessing I have to divide the data in chunks, and in order to accomplish that I need to in first place know how many rows the data has. For that purpose I read the file's metadata (which is in the header of the file):

import pyreadstat

df, meta = pyreadstat.read_sav("spss.sav", metadata_only=True)
print(meta.number_rows)
if meta.number_rows is None:
    print("Number of rows could not be recovered!")

In the case of this particular file nothing is printed, which means the underlying C library was not able to extract the information, or the information is missing from the header of the file. That's quite surprising because it is the first time I see that happening for a SAV file (it is the rule for spps por files or SAS XPORT files).

What is the source of this file? Was it produced by SPSS or by some other tool? Is it very old, or very recent?

I am afraid that at the moment this cannot be fixed. Only thing would be to determine if the information about the number of rows is actually there or not. IF it is not, there is really nothing to do. If it is there it means that the C library was not able to read it, and we can submit a ticket to the C library maintainers to see if in the future they can fix it.

Regarding your application I would suggest you could either try the multiprocessing and if it fails fall back to the normal way, or check the number of rows as I showed before, if the number of rows can be determined (it is not None), then you can proceed to multiprocessing, otherwise you have to use the slow way.

ofajardo commented 3 years ago

I'm really not an expert in the topic of SAV files binary formats, but taking a look into the data, if I write the data with pyreadstat, in the resulting SAV file I can see clearly the 450 for the number of rows (or a 450 in a place that probably is referring to the number of rows, I am not 100% sure if I am looking at the right place), while in the original file I see a -1 in the same place, meaning the number of rows could not be determined at the moment of writing the file.

If my analysis is correct, then it means the number of rows cannot be read from the header, and therefore multiprocessing will not work on this file.

jttaggart commented 3 years ago

It was generated by an online survey application. I will let you know when I find out which one.

ofajardo commented 3 years ago

In the header of the file this can be read, which I assume is the generating application:

SPSS DATA FILE Java Writer (c) Qualtrics 0.1.0

While taking another file generated by SPSS I can read:

IBM SPSS Statistics 64-bit MS Windows 25.0.0.0

and in this other file I can also see the correct number of rows in the place I am suspecting.

So my conclusion is that the application that generated this SPSS file did not record the number of rows in the header of the file and therefore you cannot use it with multiprocessing.

jttaggart commented 3 years ago

I thought it might be Qualtrics! Based on what you just told me a did a Save As form SPSS and it worked! I'll get in touch with Qualtrics support.

ofajardo commented 3 years ago

Great!

jttaggart commented 3 years ago

Thank you so much for you help.

I will let you know what, if anything, I hear from Qualtrics.

In the meantime, I will see if I have any other .sav files from them.

Thanks again.

ofajardo commented 3 years ago

Ok, I will close this because there is nothing I can do to fix the issue, but feel free to add more comments if you get more information.

jttaggart commented 3 years ago

Will do.

jttaggart commented 3 years ago

I implemented your try/except suggestion. Works perfectly. I have read lots of Qualtrics files with no problems.

ofajardo commented 3 years ago

yuhuuu!