Closed ShridharSahu closed 4 years ago
the transfer is already expired, so the file cannot be downloaded. Please upload it again, preferably zip it and drag and drop in here directly.
Attached is the file to reproduce the issue
There are several encodings that allow to open the file, for example:
import pyreadstat
df, meta = pyreadstat.read_sav("Test.sav", encoding="UTF8")
This one particularly gives: నేను గతంలో వాడిన బ.
Wether what comes out makes sense or not, no idea since I am not familiar with non latin alphabets. You can try with the list of encodings provided here https://gist.github.com/hakre/4188459 to see if one of them makes sense to you. There are a few that give something in non-latin alphabets that may make sense.
I have to say that I also try to open the file in SPSS itself, and it also could not make sense of it: the string becomes just a lot of questionmarks (????????).
Maybe you can provide feedback if you find the right encoding for this one.
Tried using your suggested method (that you suggested on my other stack overflow thread) and ran the text through all encoding format. Below is the output I got.
JAVA->à°¨à±à°¨à± à°à°¤à°à°²à± వాడిన బౠKOI8-U->Ю╟╗Ю╠┤Ю╟╗Ю╠│ Ю╟≈Ю╟єЮ╟┌Ю╟╡Ю╠▀ Ю╟╣Ю╟╬Ю╟║Ю╟©Ю╟╗ Ю╟╛Ю╠ KOI8-RU->Ю╟╗Ю╠┤Ю╟╗Ю╠│ Ю╟≈Ю╟єЮ╟┌Ю╟╡Ю╠▀ Ю╟╣Ю╟ЎЮ╟║Ю╟©Ю╟╗ Ю╟╛Ю╠ MACCENTRALEUROPE->ŗį®ŗĪáŗį®ŗĪĀ ŗįóŗį§ŗįāŗį≤ŗĪč ŗįĶŗįĺŗį°ŗįŅŗį® ŗį¨ŗĪ MACICELAND->ý∞®ý±áý∞®ý±Å ý∞óý∞§ý∞Çý∞≤ý±ã ý∞µý∞æý∞°ý∞øý∞® ý∞¨ý± MACCROATIAN->–∞®–±á–∞®–±Å –∞ó–∞§–∞Ç–∞≤–±ã –∞µ–∞ž–∞°–∞ø–∞® –∞¨–± MACROMANIA->‡∞®‡±á‡∞®‡±Å ‡∞ó‡∞§‡∞LJ∞≤‡±ã ‡∞µ‡∞ă‡∞°‡∞ş‡∞® ‡∞¨‡± MACCYRILLIC->а∞®а±За∞®а±Б а∞Ча∞§а∞Ва∞≤а±Л а∞µа∞Ња∞°а∞ња∞® а∞ђа± MACUKRAINE->а∞®а±За∞®а±Б а∞Ча∞§а∞Ва∞≤а±Л а∞µа∞Ња∞°а∞ња∞® а∞ђа± MACGREEK->ύΑ®ύ±΅ύΑ®ύ±¹ ύΑ½ύΑΛύΑ²ύΑ≤ύ±΄ ύΑΒύΑΨύΑΓύΑΩύΑ® ύΑ§ύ± MACTURKISH->‡∞®‡±á‡∞®‡±Å ‡∞ó‡∞§‡∞LJ∞≤‡±ã ‡∞µ‡∞æ‡∞°‡∞ø‡∞® ‡∞¨‡± MACTHAI->เฐจเฑเฐจเฑ» เฐเฐคเฐ…เฐฒเฑ เฐตเฐพเฐกเฐฟเฐจ เฐฌเฑ NEXTSTEP->쮤ì–Ç쮤ì–À ì®Ùì®⁄ì®Á쮆ì–Ë ì®¦ì®¬ì®¡ì®¿ì®¤ ì®‹ì– GEORGIAN-ACADEMY->ჰ°¨ჰ±‡ჰ°¨ჰ± ჰ°—ჰ°¤ჰ°‚ჰ°²ჰ±‹ ჰ°µჰ°¾ჰ°¡ჰ°¿ჰ°¨ ჰ°¬ჰ± GEORGIAN-PS->ჭ°¨ჭ±‡ჭ°¨ჭ± ჭ°—ჭ°¤ჭ°‚ჭ°²ჭ±‹ ჭ°µჭ°¾ჭ°¡ჭ°¿ჭ°¨ ჭ°¬ჭ± CP922->à°¨à±à°¨à± à°à°¤à°à°²à± వాడిన బౠCP1046->ـ٠ﺗـ١ﹱـ٠ﺗـ١× ـ٠ﻳـ٠¤ـ٠÷ـ٠٢ـ١─ ـ٠٥ـ٠ﻊـ٠ـ٠؟ـ٠ﺗ ـ٠،ـ١ CP1124->рАЈрБрАЈрБ рАрАЄрАрАВрБ рАЕрАОрАЁрАПрАЈ рАЌрБ CP1129->à°œà±à°œà± à°à°¤à°à°²à± వాడిజ బౠCP737->ω░ρω▒Θω░ρω▒Β ω░Ωω░νω░Γω░▓ω▒Μ ω░╡ω░╛ω░κω░┐ω░ρ ω░υω▒ CP853->Ó░ĤÓ▒çÓ░ĤÓ▒ü Ó░ùÓ░ñÓ░éÓ░▓Ó▒ï Ó░ÁÓ░żÓ░íÓ░┐Ó░Ĥ Ó░ĴÓ▒ CP858->Ó░¿Ó▒çÓ░¿Ó▒ü Ó░ùÓ░ñÓ░éÓ░▓Ó▒ï Ó░ÁÓ░¥Ó░íÓ░┐Ó░¿ Ó░¼Ó▒ CP1125->р░ир▒Зр░ир▒Б р░Чр░др░Вр░▓р▒Л р░╡р░╛р░бр░┐р░и р░мр▒ RISCOS-LATIN1->à°¨à±à°¨à±Ŵ à°–à°¤à°ŵà°²à±⇧ వాడిన à°¬à±
None of them seem to produce the correct format. I copy pasted the text from SPSS to editor (notepad++) and I got the text as "నేను గతంలో వాడిన బ�". I think the issue is with (�) which as per wiki is something called as replacement character.
Do let me know if you are able to share any inputs on this otherwise you can close the issue. Thanks for helping me out with this and sharing the list of all possible encoding. This really helped me a bit more in understanding encoding.
For now I have implemented another fail safe. I try to loop through all string variables in SPSS to see the variables which may cause problem and exclude them using usecols.
Did you try "UTF8" ? For me that one seems to work. Interestingly it is not in the list. In the list there is "UTF-8" (with dash) which is not working but "UTF8" (no dash) works.
My other suggestion is that even if you don't get the right encoding, you can use any that is not producing an error to read the whole file very fast. Then you can exclude the columns that are wrong by simply inspecting what character columns contain non latin charachters. This approach will be much faster than excluding columns by trapping errors when reading, as you pointed that speed is a issue.
Did try 'UTF8' but it did not work for me . Below is the error that I got. This encoding did not work for any of my sav files.
pyreadstat._readstat_parser.ReadstatError: File has an unsupported character set
Your other suggestion makes sense ... but I am going to mainly use this to read numerical data hence in case of error with default encoding I am just going ahead and skipping all string variables to skip all the trouble with encoding.
OK really strange. Sorry to hear it's not working, but yeah still, if you need only the numerical data, what you could do is to read the full dataset with an encoding that is not giving error (let's say in your case JAVA), this will be very fast, and then just discard all character columns, that's also very easy (just do df.dtypes to get the types, for those which have a 'object' type, check if the first not null element is of type string).
I think the chardet
Python package may help you guys
@evanmiller thanks for the suggestion!
However, I'm not sure if I got it right, I tried this:
import chardet
h = open("Test.sav", "rb")
c = h.read()
chardet.detect(c)
{'encoding': 'IBM866', 'confidence': 0.3091198887292049, 'language': 'Russian'}
now, passing that encoding to pyreadstat (and therefore Readstat) doesn't give anything good. Is that the way to look at it? or I assume I would need to pass the bytes of only the problematic string? if so, how to get it?
I also tried this way:
import pyreadstat
import chardet
# read the sav file with an encoding that does not fail
df, meta = pyreadstat.read_sav("Test.sav", encoding="KOI8-U")
# recover the string and put back to bytes
b = df.iloc[0,1].encode("KOI8-U")
chardet.detect(b)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
b.decode("utf-8")
# exactly the same error as seen using pyreadstat without encoding argument
# UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 48-49: unexpected end of data
# However if I leave out the last byte, it looks good:
b[0:48].decode("utf-8")
#నేను గతంలో వాడిన బ
# now I can also look for the string in the whole file, just to confirm
# assuming we still have the variable c in memory from the previous code block
c[c.find(b), c.find(b)+50].decode("utf-8") # same error
c[c.find(b), c.find(b)+48].decode("utf-8") # నేను గతంలో వాడిన బ
# increasing 51 to 60 raises errors, so it seems that removing bytes cures the issue but not adding them
So, it seems that Readstat is giving me two extra bytes ?!
I have obseved is that this file has been generated by SPSS 27, I tried reading it with SPSS 25 and did not get the string in clear text. Maybe something has changed in this new SPSS version that Readstat cannot handle correctly?
It sounds like the UTF-8 stream is being interrupted. UTF-8 is a variable byte encoding – if there is a three-byte character at the end of the string, but only room for the first two bytes in the column field, then there will be an illegal byte sequence.
I think I know what is happening. This isn't an issue with non-UTF-8 encoding, because ReadStat will ignore the last character when it receives EINVAL
from iconv. However, when both the input and output encoding are UTF-8, ReadStat skips the iconv conversion (because it thinks no conversion is necessary) and passes the (potentially illegal) byte sequence directly to the client.
So a fix would be to force all character conversions through iconv. This might slow things down slightly with UTF-8 files, but will prevent issues like this in the future.
@ofajardo Try this commit
https://github.com/WizardMac/ReadStat/commit/a8b04663ad399159b8ac710ed629295a40290c65
Sorry, that commit is broken, let me try again
The underlying issue is similar to https://github.com/WizardMac/ReadStat/issues/206 - in fact, the test introduced by the fixing commit is now failing.
@ofajardo Please try the latest commits in the dev
branch.
@evanmiller Thanks! yes, that solves the issue! the hebrew file also is good.
@ShridharSahu I have released pyreadstat version 1.0.1 that has the fix for your issue. The new version is available in pypi and will be in conda in a few hours. Please give it a try and let us know if now everything is working.
@ofajardo and @evanmiller - Thanks for working on this. This is now working
I am not able to load a sav file when I read it using pyreadstat. Can you please look into this and let me know what is happening.
Below is the code that I am using
It gives be below error UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 48-49: unexpected end of data
Some additional info File to reproduce issue - https://we.tl/t-hYNiVZ7ttC How did you install pyreadstat? - pip pyreadstat version - 1.0.0 Platform - Windows 10 Python Version - 3.8 Python Distribution - Plain python
I have 3 questions here. These may seem very amateur but please do answer them if time permits.