dopefishh / pympi

A python module for processing ELAN and Praat annotation files
MIT License
93 stars 39 forks source link

Proper error raising on encoding error #12

Closed hadware closed 5 years ago

hadware commented 5 years ago

Hello,

I'm currently working on some a pretty inconsistent dataset of TextGrid files (with @Rachine which you might have had a contact with). I had some troubles with some files because they were encoded in utf-16be (and even some in iso-8859-1), while most files where encoded in ascii. I had no idea of this inconsistency when I started to process the dataset, and although it is obviously not really your fault, I had some troubles figuring out why some TextGrid file wouldn't open with pympi.

The errors I got depending on the encoding weren't even the same, for instance, while trying to open and utf-16be file i got an AttributeError, whereas the iso-8859-1 files gave me a UnicodeDecodeError.

It would be nice to raise a proper error when the parsing fails because of encoding errors. I don't know if it's possible, but since i've dug into the TextGrid parsing function pretty far, I could PR a potential fix if you have an idea.

dopefishh commented 5 years ago

Thanks for the feedback! It is notoriously difficult to detect the encoding. There are no reliable methods that detect all encodings. If you have a way of making an error message better for a specific encoding, I'm happy to merge a PR. Any improvement improves which is a good thing.