Roche / pyreadstat

Python package to read sas, spss and stata files into pandas data frames. It is a wrapper for the C library readstat.
Other
328 stars 61 forks source link

UnicodeDecodeError: Not able to read sav file #67

Closed ShridharSahu closed 4 years ago

ShridharSahu commented 4 years ago

I am not able to load a sav file when I read it using pyreadstat. Can you please look into this and let me know what is happening.

Below is the code that I am using

import pyreadstat
df, meta = pyreadstat.read_sav('Test.sav')

It gives be below error UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 48-49: unexpected end of data

Some additional info File to reproduce issue - https://we.tl/t-hYNiVZ7ttC How did you install pyreadstat? - pip pyreadstat version - 1.0.0 Platform - Windows 10 Python Version - 3.8 Python Distribution - Plain python

I have 3 questions here. These may seem very amateur but please do answer them if time permits.

  1. Is this something because of different unicode formatting? Can this be handled directly using this module. I know there are options to change encoding but I am not sure if it can be applied on particular data or entire datasets
  2. Is there a quick way to identify unicode formatting for a text
  3. Currently I use a list and usecols options to filter out these bad variables so that at least I get the rest of data. Is there a better way to do than my approach
ofajardo commented 4 years ago

the transfer is already expired, so the file cannot be downloaded. Please upload it again, preferably zip it and drag and drop in here directly.

ShridharSahu commented 4 years ago

Test.zip

Attached is the file to reproduce the issue

ofajardo commented 4 years ago

There are several encodings that allow to open the file, for example:

import pyreadstat
df, meta = pyreadstat.read_sav("Test.sav", encoding="UTF8")

This one particularly gives: నేను గతంలో వాడిన బ.

Wether what comes out makes sense or not, no idea since I am not familiar with non latin alphabets. You can try with the list of encodings provided here https://gist.github.com/hakre/4188459 to see if one of them makes sense to you. There are a few that give something in non-latin alphabets that may make sense.

I have to say that I also try to open the file in SPSS itself, and it also could not make sense of it: the string becomes just a lot of questionmarks (????????).

Maybe you can provide feedback if you find the right encoding for this one.

ShridharSahu commented 4 years ago

Tried using your suggested method (that you suggested on my other stack overflow thread) and ran the text through all encoding format. Below is the output I got.

JAVA->నేను గతంలో వాడిన బౠKOI8-U->Ю╟╗Ю╠┤Ю╟╗Ю╠│ Ю╟≈Ю╟єЮ╟┌Ю╟╡Ю╠▀ Ю╟╣Ю╟╬Ю╟║Ю╟©Ю╟╗ Ю╟╛Ю╠ KOI8-RU->Ю╟╗Ю╠┤Ю╟╗Ю╠│ Ю╟≈Ю╟єЮ╟┌Ю╟╡Ю╠▀ Ю╟╣Ю╟ЎЮ╟║Ю╟©Ю╟╗ Ю╟╛Ю╠ MACCENTRALEUROPE->ŗį®ŗĪáŗį®ŗĪĀ ŗįóŗį§ŗįāŗį≤ŗĪč ŗįĶŗįĺŗį°ŗįŅŗį® ŗį¨ŗĪ MACICELAND->ý∞®ý±áý∞®ý±Å ý∞óý∞§ý∞Çý∞≤ý±ã ý∞µý∞æý∞°ý∞øý∞® ý∞¨ý± MACCROATIAN->–∞®–±á–∞®–±Å –∞ó–∞§–∞Ç–∞≤–±ã –∞µ–∞ž–∞°–∞ø–∞® –∞¨–± MACROMANIA->‡∞®‡±á‡∞®‡±Å ‡∞ó‡∞§‡∞LJ∞≤‡±ã ‡∞µ‡∞ă‡∞°‡∞ş‡∞® ‡∞¨‡± MACCYRILLIC->а∞®а±За∞®а±Б а∞Ча∞§а∞Ва∞≤а±Л а∞µа∞Ња∞°а∞ња∞® а∞ђа± MACUKRAINE->а∞®а±За∞®а±Б а∞Ча∞§а∞Ва∞≤а±Л а∞µа∞Ња∞°а∞ња∞® а∞ђа± MACGREEK->ύΑ®ύ±΅ύΑ®ύ±¹ ύΑ½ύΑΛύΑ²ύΑ≤ύ±΄ ύΑΒύΑΨύΑΓύΑΩύΑ® ύΑ§ύ± MACTURKISH->‡∞®‡±á‡∞®‡±Å ‡∞ó‡∞§‡∞LJ∞≤‡±ã ‡∞µ‡∞æ‡∞°‡∞ø‡∞® ‡∞¨‡± MACTHAI->เฐจเฑเฐจเฑ» เฐเฐคเฐ…เฐฒเฑ เฐตเฐพเฐกเฐฟเฐจ เฐฌเฑ NEXTSTEP->쮤ì–Ç쮤ì–À ì®Ùì®⁄ì®Á쮆ì–Ë ì®¦ì®¬ì®¡ì®¿ì®¤ ì®‹ì– GEORGIAN-ACADEMY->ჰ°¨ჰ±‡ჰ°¨ჰ± ჰ°—ჰ°¤ჰ°‚ჰ°²ჰ±‹ ჰ°µჰ°¾ჰ°¡ჰ°¿ჰ°¨ ჰ°¬ჰ± GEORGIAN-PS->ჭ°¨ჭ±‡ჭ°¨ჭ± ჭ°—ჭ°¤ჭ°‚ჭ°²ჭ±‹ ჭ°µჭ°¾ჭ°¡ჭ°¿ჭ°¨ ჭ°¬ჭ± CP922->నేను గతంలో వాడిన బౠCP1046->ـ٠ﺗـ١ﹱـ٠ﺗـ١× ـ٠ﻳـ٠¤ـ٠÷ـ٠٢ـ١─ ـ٠٥ـ٠ﻊـ٠ـ٠؟ـ٠ﺗ ـ٠،ـ١ CP1124->рАЈрБ‡рАЈрБ рА—рАЄрА‚рАВрБ‹ рАЕрАОрАЁрАПрАЈ рАЌрБ CP1129->జేజు గతంలో వాడిజ బౠCP737->ω░ρω▒Θω░ρω▒Β ω░Ωω░νω░Γω░▓ω▒Μ ω░╡ω░╛ω░κω░┐ω░ρ ω░υω▒ CP853->Ó░ĤÓ▒çÓ░ĤÓ▒ü Ó░ùÓ░ñÓ░éÓ░▓Ó▒ï Ó░ÁÓ░żÓ░íÓ░┐Ó░Ĥ Ó░ĴÓ▒ CP858->Ó░¿Ó▒çÓ░¿Ó▒ü Ó░ùÓ░ñÓ░éÓ░▓Ó▒ï Ó░ÁÓ░¥Ó░íÓ░┐Ó░¿ Ó░¼Ó▒ CP1125->р░ир▒Зр░ир▒Б р░Чр░др░Вр░▓р▒Л р░╡р░╛р░бр░┐р░и р░мр▒ RISCOS-LATIN1->నేనà±Ŵ à°–à°¤à°ŵà°²à±⇧ వాడిన à°¬à±

None of them seem to produce the correct format. I copy pasted the text from SPSS to editor (notepad++) and I got the text as "నేను గతంలో వాడిన బ�". I think the issue is with (�) which as per wiki is something called as replacement character.

Do let me know if you are able to share any inputs on this otherwise you can close the issue. Thanks for helping me out with this and sharing the list of all possible encoding. This really helped me a bit more in understanding encoding.

For now I have implemented another fail safe. I try to loop through all string variables in SPSS to see the variables which may cause problem and exclude them using usecols.

ofajardo commented 4 years ago

Did you try "UTF8" ? For me that one seems to work. Interestingly it is not in the list. In the list there is "UTF-8" (with dash) which is not working but "UTF8" (no dash) works.

My other suggestion is that even if you don't get the right encoding, you can use any that is not producing an error to read the whole file very fast. Then you can exclude the columns that are wrong by simply inspecting what character columns contain non latin charachters. This approach will be much faster than excluding columns by trapping errors when reading, as you pointed that speed is a issue.

ShridharSahu commented 4 years ago

Did try 'UTF8' but it did not work for me . Below is the error that I got. This encoding did not work for any of my sav files. pyreadstat._readstat_parser.ReadstatError: File has an unsupported character set

Your other suggestion makes sense ... but I am going to mainly use this to read numerical data hence in case of error with default encoding I am just going ahead and skipping all string variables to skip all the trouble with encoding.

ofajardo commented 4 years ago

OK really strange. Sorry to hear it's not working, but yeah still, if you need only the numerical data, what you could do is to read the full dataset with an encoding that is not giving error (let's say in your case JAVA), this will be very fast, and then just discard all character columns, that's also very easy (just do df.dtypes to get the types, for those which have a 'object' type, check if the first not null element is of type string).

evanmiller commented 4 years ago

I think the chardet Python package may help you guys

https://chardet.readthedocs.io/en/latest/

ofajardo commented 4 years ago

@evanmiller thanks for the suggestion!

However, I'm not sure if I got it right, I tried this:

import chardet
h = open("Test.sav", "rb")
c = h.read()
chardet.detect(c)
{'encoding': 'IBM866', 'confidence': 0.3091198887292049, 'language': 'Russian'}

now, passing that encoding to pyreadstat (and therefore Readstat) doesn't give anything good. Is that the way to look at it? or I assume I would need to pass the bytes of only the problematic string? if so, how to get it?

I also tried this way:

import pyreadstat
import chardet

# read the sav file with an encoding that does not fail
df, meta = pyreadstat.read_sav("Test.sav", encoding="KOI8-U")
# recover the string and put back to bytes
b = df.iloc[0,1].encode("KOI8-U")
chardet.detect(b)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
b.decode("utf-8")
# exactly the same error as seen using pyreadstat without encoding argument
# UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 48-49: unexpected end of data

# However if I leave out the last byte, it looks good:
b[0:48].decode("utf-8")
#నేను గతంలో వాడిన బ

# now I can also look for the string in the whole file, just to confirm
# assuming we still have the variable c in memory from the previous code block
c[c.find(b), c.find(b)+50].decode("utf-8") # same error
c[c.find(b), c.find(b)+48].decode("utf-8") # నేను గతంలో వాడిన బ
# increasing 51 to 60 raises errors, so it seems that removing bytes cures the issue but not adding them

So, it seems that Readstat is giving me two extra bytes ?!

I have obseved is that this file has been generated by SPSS 27, I tried reading it with SPSS 25 and did not get the string in clear text. Maybe something has changed in this new SPSS version that Readstat cannot handle correctly?

evanmiller commented 4 years ago

It sounds like the UTF-8 stream is being interrupted. UTF-8 is a variable byte encoding – if there is a three-byte character at the end of the string, but only room for the first two bytes in the column field, then there will be an illegal byte sequence.

I think I know what is happening. This isn't an issue with non-UTF-8 encoding, because ReadStat will ignore the last character when it receives EINVAL from iconv. However, when both the input and output encoding are UTF-8, ReadStat skips the iconv conversion (because it thinks no conversion is necessary) and passes the (potentially illegal) byte sequence directly to the client.

So a fix would be to force all character conversions through iconv. This might slow things down slightly with UTF-8 files, but will prevent issues like this in the future.

evanmiller commented 4 years ago

@ofajardo Try this commit

https://github.com/WizardMac/ReadStat/commit/a8b04663ad399159b8ac710ed629295a40290c65

evanmiller commented 4 years ago

Sorry, that commit is broken, let me try again

evanmiller commented 4 years ago

The underlying issue is similar to https://github.com/WizardMac/ReadStat/issues/206 - in fact, the test introduced by the fixing commit is now failing.

evanmiller commented 4 years ago

@ofajardo Please try the latest commits in the dev branch.

ofajardo commented 4 years ago

@evanmiller Thanks! yes, that solves the issue! the hebrew file also is good.

ofajardo commented 4 years ago

@ShridharSahu I have released pyreadstat version 1.0.1 that has the fix for your issue. The new version is available in pypi and will be in conda in a few hours. Please give it a try and let us know if now everything is working.

ShridharSahu commented 4 years ago

@ofajardo and @evanmiller - Thanks for working on this. This is now working