Closed jttaggart closed 3 years ago
Thanks. Apparently the zip contains only the python script. Would you be so kind of including the SPSS file as well?
Ooops…
hi, just in case I still can't access the spss file ...
Hi,
I’ll upload on github.
JT
Yes, please do upload the file, I cannot reproduce the issue
I uploaded again yesterday.
I cant see it.
I just attached… .zip file
Let me know if you get it.
No, I dont see any zip file. I think attaching from email doesnt work, you have to attach from the github webpage
Doing that now…
Uploading files... pyread_multi.zip
I accidentally closed!
No prob. Now I can see the zip. Thanks!
Thank God!!!
hi
I cannot replicate the issue on ubuntu. Meaning, this runs good:
import pandas as pd
import pyreadstat
#savFile = "pyread_multi.sav" # this is also fine
savFile = "pyread_multi_1.sav"
df, meta = pyreadstat.read_sav(savFile, user_missing=False)
print("normal is good!")
df2, meta2 = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav,
savFile,
num_processes=1,
user_missing=False)
print("multi is good")
I'll try on windows.
Please copy and paste here the full stacktrace you get, and please make sure you are using the latest version of pyreadstat (1.0.8)
OK, I do see an error on windows:
RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.
This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce a Windows executable.
The way to solve this is explained in the Readme here in the "Notes for windows" section, basically you have to include a "if name == main":
import pandas as pd
import pyreadstat
if __name__ == "__main__":
#savFile = "pyread_multi.sav"
savFile = "pyread_multi_1.sav"
df, meta = pyreadstat.read_sav(savFile, user_missing=False)
print("normal is good!")
df2, meta2 = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav,
savFile,
num_processes=1,
user_missing=False)
print("multi is good")
also see here
So that should solve the issue. Is it good?
I am a little bit intrigued by you saying that it fails if the first record is a date. Does that mean that it works for you if the type is not a date? If yes can you provide an example file?
Hi Otto,
Yes, I’m using 1.08
I’ll send you my complete application (there are a lot of imports).
I used multiprocessing in order to have “spinner” to stop users bashing away at the keyboard when reading file.
SPSSedit_v3.0.py
When application opens click on Open button and select spss.sav
Yes, I’m running on Windows 10 64 bit.
pyreadstat.read_file_multiprocessing()
is on Line 1596
What is very strange is now reading both files I sent you which are subsets of spss.sav attached (_1 with date as 1st variable). It’s definitely failing on spss.sav. I tried several times to be absolutely sure. I used to work in support myself many decades ago.
I’ll let you know when I have uploaded to github.
Files uploaded. pyreadstat_multi.zip
Regards,
JT
I don't see any file.
I still don't understand what is your error (since you have not copy pasted it in here). If your error message is what I showed before, then its a general pythonn limitation on windows and you have to solve it as I indicated or find some workaround as it is not a pyreadtat issue but much beyond this libary.
If it is something else please copy and paste the full stacktrace and submit a reproducible example. Execute your script in the command line, dont use any IDE or double click on it.
Regarding the example, please make it minimal ... I won't debug a full application but only code regarding pyreadstat.
And if possible stop replying from email, it generates a very verbose thread that is difficult to read, and in addition your files are not attached. Write directly into the webpage.
I have used file attached spss.sav. With you script it is failing.
pyread_multi.zip
normal is good
Traceback (most recent call last):
File "C:/Users/johntaggart/Documents/PycharmProjects/DataEditor/pyread_multi.py", line 12, in
Process finished with exit code 1
thanks a lot for the error message and the example, now it is very clear to me what is happening:
in order to do the multiprocessing I have to divide the data in chunks, and in order to accomplish that I need to in first place know how many rows the data has. For that purpose I read the file's metadata (which is in the header of the file):
import pyreadstat
df, meta = pyreadstat.read_sav("spss.sav", metadata_only=True)
print(meta.number_rows)
if meta.number_rows is None:
print("Number of rows could not be recovered!")
In the case of this particular file nothing is printed, which means the underlying C library was not able to extract the information, or the information is missing from the header of the file. That's quite surprising because it is the first time I see that happening for a SAV file (it is the rule for spps por files or SAS XPORT files).
What is the source of this file? Was it produced by SPSS or by some other tool? Is it very old, or very recent?
I am afraid that at the moment this cannot be fixed. Only thing would be to determine if the information about the number of rows is actually there or not. IF it is not, there is really nothing to do. If it is there it means that the C library was not able to read it, and we can submit a ticket to the C library maintainers to see if in the future they can fix it.
Regarding your application I would suggest you could either try the multiprocessing and if it fails fall back to the normal way, or check the number of rows as I showed before, if the number of rows can be determined (it is not None), then you can proceed to multiprocessing, otherwise you have to use the slow way.
I'm really not an expert in the topic of SAV files binary formats, but taking a look into the data, if I write the data with pyreadstat, in the resulting SAV file I can see clearly the 450 for the number of rows (or a 450 in a place that probably is referring to the number of rows, I am not 100% sure if I am looking at the right place), while in the original file I see a -1 in the same place, meaning the number of rows could not be determined at the moment of writing the file.
If my analysis is correct, then it means the number of rows cannot be read from the header, and therefore multiprocessing will not work on this file.
It was generated by an online survey application. I will let you know when I find out which one.
In the header of the file this can be read, which I assume is the generating application:
SPSS DATA FILE Java Writer (c) Qualtrics 0.1.0
While taking another file generated by SPSS I can read:
IBM SPSS Statistics 64-bit MS Windows 25.0.0.0
and in this other file I can also see the correct number of rows in the place I am suspecting.
So my conclusion is that the application that generated this SPSS file did not record the number of rows in the header of the file and therefore you cannot use it with multiprocessing.
I thought it might be Qualtrics! Based on what you just told me a did a Save As form SPSS and it worked! I'll get in touch with Qualtrics support.
Great!
Thank you so much for you help.
I will let you know what, if anything, I hear from Qualtrics.
In the meantime, I will see if I have any other .sav files from them.
Thanks again.
Ok, I will close this because there is nothing I can do to fix the issue, but feel free to add more comments if you get more information.
Will do.
I implemented your try/except suggestion. Works perfectly. I have read lots of Qualtrics files with no problems.
yuhuuu!
Please read the README, particularly the known limitations section!
If first record is SPSS Date format pyreadstat.read_file_multiprocessing() fails.
To Reproduce See attached files.
File example See attached files.
Setup Information: Platform (Windows 10 64 bit) Python Version 3.7 Using venv pyread_multi.zip