IDR / idr-utils

Utility scripts for managing IDR submissions
BSD 2-Clause "Simplified" License
2 stars 6 forks source link

Study parser decoding errors #39

Closed sbesson closed 3 years ago

sbesson commented 3 years ago

As study files usually are submitted into various encoding, the study parser will regularly fail with encoding errors of type:

Traceback (most recent call last):
  File "pyidr/study_parser.py", line 659, in <module>
    parser = main(sys.argv[1:])
  File "pyidr/study_parser.py", line 632, in main
    p = StudyParser(s)
  File "pyidr/study_parser.py", line 113, in __init__
    self._study_lines = f.readlines()
  File "/Users/sbesson/anaconda3/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 7339: invalid start byte

One approach is to unify and force the encoding of study files. This PR explores the alternative approach and makes the study_parser more lenient by forcing an UTF-8 encoding but ignoring errors.

Tested with idr0072 study file

dominikl commented 3 years ago

Looks good, works fine now with all sorts of characters/symbols! 👍

sbesson commented 3 years ago

Thanks. Merging to bump the submodules. We'll just have to review that special characters are properly handled in future studies