dopefishh / pympi

A python module for processing ELAN and Praat annotation files
MIT License
93 stars 39 forks source link

Compatibility with ACT R library #56

Open lucientisserand opened 7 months ago

lucientisserand commented 7 months ago

Expected behaviour The Pympi's exported ELAN file should be opened by the Annotated Corpus Toolkit (ACT) or should be formatted as original ELAN file.

Actual behaviour The exported ELAN file should be able to be processed by ACT or should be formatted as original ELAN.

System information

Additional context I work both with Pympi and Oliver Ehmer's Annotated Corpus Tollkit for R (ACT) that are too great pieces of code for linguists working with ELAN. I noticed that the ELAN files exported with pympi (with or without "pretty" parameter) could not be processed directly by ACT (see below). However, they can if this file has been opened then saved in ELAN. So I took a look at diffs between the pympi's fresh export and the ELAN overwrite and found these two located issues when importing pympi file in ACT :

  1. the file would not be loaded at all : apparently this error is due to the EAF version statement of the file for the attribute xsi:noNamespaceSchemaLocation (3.0 will be loaded, not 2.8).
  2. if issue 1 is corrected (2.8>3.0), the file is loaded but then the time values are not found by ACT : however it works if the "space" character before the TIME_SLOT closing tag is removed.

Workaround found If I bulk replace version number (2.8>3.0) and if I bulk remove the space character before every closing XML tag, then the file is successfully processed by ACT. Since the original ELAN files are not formatted as such, I though it was more a "pympi" issue rather than an "ACT" issue. So maybe some slight export modifications are welcome in pympi ?

Thank you for your work, Lucien

dopefishh commented 7 months ago

Extra spacing before closing XML tags signals a fragile XML parser from ACT's side. However, I'm not opposed to generating stricter XML without this spacing as it doesn't change the semantics of the file. Increasing the version can be done, but we have to make sure that the generated file really is 3.0 compliant. Since the major version is increased, I assume there are some backwards incompatible changes between 2.x and 3.x. The specification can be found here: https://www.mpi.nl/tools/elan/EAF_Annotation_Format_3.0_and_ELAN.pdf In a previous issue we found that it probably is compatible though (https://github.com/dopefishh/pympi/issues/29).

So in short, yes please, I'm would be happy to accept merge requests for this.

lucientisserand commented 7 months ago

Totally agree, when I have time I will have a look into the differences between 2.8 and 3.0 before trying to propose something (still a beginner in python but learning by doing). Also maybe proposing ACT to treat space character cases as it's compliant with XML syntax. In the meantime I hope some people may find the workaround useful it they are blocked.