dopefishh / pympi

A python module for processing ELAN and Praat annotation files
MIT License
93 stars 39 forks source link

Script working fine until I save file in ELAN 5.9 and EAF file gets corrupted #27

Closed macramole closed 3 years ago

macramole commented 4 years ago

Hi, I'm trying to add some tiers and not-overlapping segments to my EAF file.

I'm using the following code:

    eaf = pympi.Eaf(fullPath)
    #si ya existe el tier no pasa nada
    eaf.add_tier("code")
    eaf.add_tier("code_num")
    eaf.add_tier("on_off")
    eaf.add_tier("context")
    eaf.add_tier("note")

    i = 0
    for segmento in row["Tiempos en milisegundos"].split(" "):
        segmento = segmento.split("-")
        timeFrom = int(segmento[0])
        timeTo = int(segmento[1])

        eaf.add_annotation("code", timeFrom, timeTo, value="")
        eaf.add_annotation("code_num", timeFrom, timeTo, value=str(i))
        eaf.add_annotation("on_off", timeFrom, timeTo, value=f"{timeFrom}_{timeTo}")
        eaf.add_annotation("context", timeFrom - 120000, timeTo + 60000, value=" ")
        eaf.add_annotation("note", timeFrom - 120000, timeTo + 60000, value="RandomSampling para variation sets con ACLEW")
        eaf.to_file(f"{targetDir}/{filename}")

        i += 1

The EAF files are created and I can open them with ELAN 5.9. I can see selected segments and everything seems to be working fine.

The problem is when I add a new segment from ELAN and save, the file gets corrupted and cannot be opened any more.

Examining the EAF file I can see that for instance these lines:

<TIER TIER_ID="code" LINGUISTIC_TYPE_REF="dependency">
    <ANNOTATION>
        <ALIGNABLE_ANNOTATION ANNOTATION_ID="a3327" TIME_SLOT_REF1="ts6653" TIME_SLOT_REF2="ts6654">
            <ANNOTATION_VALUE />
        </ALIGNABLE_ANNOTATION>
    </ANNOTATION>

become:

 <TIER LINGUISTIC_TYPE_REF="dependency" TIER_ID="code">
    <ANNOTATION>
        <ALIGNABLE_ANNOTATION ANNOTATION_ID="a3332" TIME_SLOT_REF1="" TIME_SLOT_REF2="">
            <ANNOTATION_VALUE></ANNOTATION_VALUE>
        </ALIGNABLE_ANNOTATION>
    </ANNOTATION>

TIME_SLOT_REF1 and 2 are empty! :(

Original EAF files where created using chat2elan from CLAN project. Opening and editing this EAF files using ELAN 5.9 works just fine.

System information

sarpu commented 4 years ago

What happens when you validate the eaf file from Elan? So File -> Validate EAF File

dopefishh commented 4 years ago

It might be that ELAN changed the EAF format for newer versions. The XML scheme should be defined in the header.

macramole commented 4 years ago

@sarpu this errorss might be relevant:

ERROR: tier "code" has no parent tier but has stereotype CONSTRAINT "Symbolic_Association" defined in its linguistic type "dependency" ERROR: the tier "code" contains 15 alignable annotations not consistent with tier stereotype "Symbolic_Association" Checking tier: code_num ERROR: tier "code_num" has no parent tier but has stereotype CONSTRAINT "Symbolic_Association" defined in its linguistic type "dependency" ERROR: the tier "code_num" contains 15 alignable annotations not consistent with tier stereotype "Symbolic_Association" Checking tier: on_off ERROR: tier "on_off" has no parent tier but has stereotype CONSTRAINT "Symbolic_Association" defined in its linguistic type "dependency" ERROR: the tier "on_off" contains 15 alignable annotations not consistent with tier stereotype "Symbolic_Association" Checking tier: context ERROR: tier "context" has no parent tier but has stereotype CONSTRAINT "Symbolic_Association" defined in its linguistic type "dependency" ERROR: the tier "context" contains 15 alignable annotations not consistent with tier stereotype "Symbolic_Association" Checking tier: note ERROR: tier "note" has no parent tier but has stereotype CONSTRAINT "Symbolic_Association" defined in its linguistic type "dependency" ERROR: the tier "note" contains 15 alignable annotations not consistent with tier stereotype "Symbolic_Association" There are tier-type/tier-hierarchy inconsistencies. Please refer to the EAF format documentation:

macramole commented 4 years ago

@dopefishh yes, I've used this script with ELAN 5.1 and it used to work. Our annotators can't use that version anymore because of a Java version problem

dopefishh commented 4 years ago

It seems that they indeed upgraded to XML scheme version 3.0, a warning is probably emitted when reading these files. The changes to this need to be implemented. I could definitely use help for this.

sarpu commented 4 years ago

@dopefishh I can take a shot at this. Is there a sample 3.0 file similar to the ones for 2.7 and 2.8 under examples?

dopefishh commented 4 years ago

@sarpu: Thanks, I'll be happy to accept a PR

the old scheme is available here: http://www.mpi.nl/tools/elan/EAFv2.8.xsd

the scheme is available here: http://www.mpi.nl/tools/elan/EAFv3.0.xsd

A human readable explanation is available here: https://www.mpi.nl/tools/elan/EAF_Annotation_Format_3.0_and_ELAN.pdf

the old scheme's human readable explanation is available here: https://www.mpi.nl/tools/elan/EAF_Annotation_Format_2.8_and_ELAN.pdf

sarpu commented 4 years ago

So I am digging around the code and I don't think this issue is related to EAF 3.0 update. In the add_tier function, the code picks up the first available linguistic type if one is not specified. @macramole doesn't specify a linguistic type when adding a tier, so the code picks up dependency:

https://github.com/dopefishh/pympi/blob/ad5e52b15979b09ea43df5e25dcf1c5b280e99fb/pympi/Elan.py#L385-L386

The code tier, through its automatically picked linguistic type dependency, has the constraint "Symbolic Association", which, by both EAF 2.8 and 3.0 standards, means that the code tier can only include reference annotations. But @macramole is using the add_annotation function, which adds an aligned annotation as opposed to a reference annotation as required by the linguistic type of the code tier, which is incorrect by both 2.8 and 3.0. So that is the reason why ELAN is probably deleting the time references.

So all of this to say that I think the code should simply raise an exception if adding an aligned annotation to a tier of a linguistic type with a constraint that requires reference annotations instead of aligned annotations (since the two cannot be mixed in a tier). I will submit a PR to that effect, since I looked over both 2.8 and 3.0 and the changes don't seem likely to cause something like this. What do you think @dopefishh ?

And @macramole, if you specifiy a linguistic type that allows for aligned annotations in your call to the add_tier function, the code should work I believe.

dopefishh commented 4 years ago

@macramole Can you verify that it works now with the merged MR?

macramole commented 4 years ago

yes I will report here. what should I put as "linguistic type" so it works as I want ?

sarpu commented 4 years ago

First create a linguistic type like eaf.add_linguistic_type('custom') (custom is an arbitrary name I picked for a linguistic type, it can be anything you want), then, for all of the add_tier calls, modify them as add_tier('code', ling='custom') (again, instead of custom, use the linguistic type you just created above).

sarpu commented 3 years ago

@macramole Did you manage to resolve the issue?