Shapefiles encoded in Latin-1 return a mixture of str and (not decoded) bytes

ignamv commented 7 years ago

Hi. If I load the Buenos Aires streets shapefile with pyshp, the records with accented letters are returned without decoding (as bytes). This is because of an attempt to decode as UTF-8 which ignores exceptions. I have a simple fix by adding an optional "encoding" argument in the Reader constructor. This allows the Reader to correctly process both my shapefile and the one which motivated the faulty try/except block.

Would it be OK if I added a unittest-based test for this? Or would you prefer a more backwards-compatible way?

Thanks Ignacio

ggm-at-apnic commented 7 years ago

Byte 29 of the .dbf file encodes the codepage. If its set to value 0x57 its codepage 1252. Thats ISO-8859-1 or Latin1 and we could do decode('latin1') safely. Otherwise. its probably UTF-8.

the .cpg file also specifies the codepage if there is one.

micahcochran commented 7 years ago

I wrote some code that identified the DBF codepage and convert it. I'll see if I can force myself to get a PR together.

karimbahgat commented 7 years ago

@ignamv I have now accepted your PR #106, all of this will end up in a new major version anyway.

@ggm-at-apnic that would be great if the encoding can be stored in the dbf file itself, but you say it's only a flag indicating latin1 or utf8? I also remember reading somewhere (though I can't find it right now) that the Shapefile format is based on Dbase III, where bytes 28-31 are reserved for future use. Only for versions IV and later can I find that byte 29 is the "Language Driver ID", though no explanation is given for how to interpret it... Can anyone confirm whether shapefiles are indeed tied to version III?

@micahcochran that would be amazing if you could locate your code for the cpg file, that would be a great addition!

karimbahgat commented 7 years ago

I believe now that PR #106 allows specifying encoding and no longer silently returns bytes upon failure, this issue should be resolved.

ArtiiP commented 7 years ago

Can anyone confirm whether shapefiles are indeed tied to version III?

it's tied to dbf file :-) in any version - it's safer don't assume anything. sorry.

btw "on the wild" 29 byte can have really random value :-/ i saw all combination. .cpg is far more reliable.

micahcochran commented 7 years ago

Can anyone confirm whether shapefiles are indeed tied to version III?

dBase IV.... I think (https://en.wikipedia.org/wiki/Shapefile)

I used Shapefile C Library as a basis for my code, which GDAL and MapServer use forks within their code repositories. I didn't really want to reinvent the wheel too badly.

btw "on the wild" 29 byte can have really random value :-/ i saw all combination.

That value should correspond to this table http://shapelib.maptools.org/codepage.html If there is a .cpg file, it takes precedence.

karimbahgat commented 7 years ago

Thanks for the link about the dbase version and the list of cpg codepage values, they should prove very useful @micahcochrain!

But now I'm curious. When i was looking over the specs it actually seems like PyShp is built around v7 of dbase. Its handling of true false values uses space for missing as in v7, rather than "?" as in v3-5. But more importantly it reads numeric and float fields as padded strings as in v7, rather than as actual binary doubles and floats as in v3-5. And since pyshp has worked so nicely for all this time one could only assume they must have been v7 right? See the format specs for v7 VS for v3-5.

Maybe since the version was never specified in the esri shapefile spec it really could be any dbase version as @ArtiiP suggested? And maybe by chance the majority of shapefiles have simply been written with the newest dbase version?

At any rate, encoding read and write support now works smoothly, along with docs and tests, so that the user is always working with unicode, on both py2 and 3. So the last piece of the puzzle that would really complete pyshp's unicode support is to check the language flag or the cpg file. We can also check for the dbase version flag (which I believe is in the header) just to make sure it's v4 or higher.

ArtiiP commented 7 years ago

Usually files that i gets from different systems, is mostly lev5 ie 0x0:0x3 (and header is 32B long). esri soft(i guess) produce 0x3 as well.

For me, as long as i can force any encoding for any version is ok, to make some magic in guessing encoding. ofc only for not forced.

GeospatialPython / pyshp

Shapefiles encoded in Latin-1 return a mixture of str and (not decoded) bytes #104