dopefishh / pympi

A python module for processing ELAN and Praat annotation files
MIT License
93 stars 39 forks source link

TextGrid: multi-line parsing, escaped quote parsing #55

Closed myrix closed 7 months ago

myrix commented 7 months ago

These are fixes for some problem encountered while parsing some of our TextGrids:

  1. Sometimes broken line-by-line decoding of n-byte encodings like UTF-16.
  2. Inability to parse texts with newlines in them.
  3. Inner text quotes escaped by doubling remaining doubled.

PR a) reworks line-by-line decoding of text format TextGrids to whole file decoding, b) enables parsing of multiline texts containing arbitrary number of newlines by repeatedly looking at more and more lines until the whole text is completed, and c) properly turns double-escaped quotes inside each read text back into single quotes after the text is read.

dopefishh commented 7 months ago

Thank you very much. This may fix https://github.com/dopefishh/pympi/issues/52 also.