brendonh / pyth

Python text markup and conversion
MIT License
89 stars 79 forks source link

rtf reader: unichr() causes ValueError () #4

Closed joka closed 14 years ago

joka commented 14 years ago

I have a rtf file with strange unicode strings (send you an email).

This causes rtf reader to throw ValueError:

* Module pyth.plugins.rtf15.reader, line 93, in read
* Module pyth.plugins.rtf15.reader, line 113, in go
* Module pyth.plugins.rtf15.reader, line 141, in parse
* Module pyth.plugins.rtf15.reader, line 369, in handle
* Module pyth.plugins.rtf15.reader, line 476, in handle_u
ValueError: unichr() arg not in range(0x10000) (narrow Python build) 

The reason why is, my python was build without support for "wide" Unicode characters. (http://www.python.org/dev/peps/pep-0261/). However, an exception handling would be nice.

brendonh commented 14 years ago

This is a legitimate bug but I'm not sure what the correct fix is. It's possible to construct a surrogate pair to represent the character, with something like:

struct.pack('<L', 0x10000).decode('utf-32')

Which will "work", but perhaps cause other bugs down the line with e.g. slicing. So maybe we shouldn't do it.

Other alternatives include replacing it with '?', or just raising a different exception type. I haven't decided.

joka commented 14 years ago

I like the solution: replacing with ? + log message

brendonh commented 14 years ago

I've tried the struct trick above. It means that plugins should never trust len() of unicode strings, or slice them. But that's probably true anyway.