LAAC-LSCP / ChildProject

Python package for the management of day-long recordings of children.
https://childproject.readthedocs.io
MIT License
13 stars 5 forks source link

Add EAF controlled vocabulary to metadata #344

Open William-N-Havard opened 2 years ago

William-N-Havard commented 2 years ago

Is your feature request related to a problem? Please describe. EAF tiers can be assigned a specific controlled vocabulary, which is defined by the creator of the EAF file, that the annotators will use during the annotation campaign. This ensures that the annotators do not add custom labels (either intentionally or by mistake).

First, when importing annotations belonging to a new type of tier (see issue #343) it would be good to ensure that all the annotations use labels defined in the controlled vocabulary (it's better to be safe than sorry!)

Second, it would be nice to also import the description of each label of the controlled vocabulary and store it somewhere. This description is stored directly in the EAF file. Storing this description would allow users of the data set to understand the meaning of the codes used during the annotation campaign.

<CONTROLLED_VOCABULARY CV_ID="vcm">
        <DESCRIPTION LANG_REF="und">Simplified subset of infant vocal maturity classes (distinguishing between variegated and non-variegated syllables)</DESCRIPTION>
        <CV_ENTRY_ML CVE_ID="cveid_e7300257-f12a-479f-90f0-c2fefbf99a26">
            <CVE_VALUE DESCRIPTION="Crying" LANG_REF="und">Y</CVE_VALUE>
        </CV_ENTRY_ML>
        <CV_ENTRY_ML CVE_ID="cveid_ae00bfde-d4bb-499e-8c63-81c4459f5b8a">
            <CVE_VALUE DESCRIPTION="Laughing" LANG_REF="und">L</CVE_VALUE>
        </CV_ENTRY_ML>
        <CV_ENTRY_ML CVE_ID="cveid_df01bf24-04f4-4cff-9bc4-ca92a0ca945f">
            <CVE_VALUE
                DESCRIPTION="Non-canonical non-variegated syllable(s)" LANG_REF="und">A</CVE_VALUE>
        </CV_ENTRY_ML>
        <CV_ENTRY_ML CVE_ID="cveid_8675a2cf-bb35-476c-a602-8b911eb2a845">
            <CVE_VALUE
                DESCRIPTION="Non-canonical variegated syllable(s)" LANG_REF="und">P</CVE_VALUE>
        </CV_ENTRY_ML>
        <CV_ENTRY_ML CVE_ID="cveid_f1ad7cdd-4916-4914-a59a-a33d0d7052cc">
            <CVE_VALUE DESCRIPTION="Canonical variegated syllable(s)" LANG_REF="und">V</CVE_VALUE>
        </CV_ENTRY_ML>
        <CV_ENTRY_ML CVE_ID="cveid_09a9bb98-31a9-4afd-9ed7-d4fc7af658a6">
            <CVE_VALUE
                DESCRIPTION="Canonical non-variegated syllable(s)" LANG_REF="und">W</CVE_VALUE>
        </CV_ENTRY_ML>
        <CV_ENTRY_ML CVE_ID="cveid_ee07af47-c822-4fb3-80d3-d842d80272b7">
            <CVE_VALUE DESCRIPTION="Uncertain" LANG_REF="und">U</CVE_VALUE>
        </CV_ENTRY_ML>
    </CONTROLLED_VOCABULARY>

Describe the solution you'd like Check controlled vocabulary when importing EAF file and add the description of the controlled vocabulary labels to the metadata.

marianne-m commented 2 years ago

For the second part, where do you think we should store the description ?

William-N-Havard commented 2 years ago

Good question! I'm not sure where it would be best to store them. I see two options:

where EAF is the name of the directory containing the EAF files for which we want to store the controlled vocabularies (there can be more than one in a single EAF file). I usually prefer to have all the metadata stored in the same place, so I'd personally go for the second option.