AylaRT / ACTER

ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).
19 stars 2 forks source link

ACTER Annotated Corpora for Term Extraction Research, version 1.5

ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).

Readme structure:

  1. General
  2. Abbreviations
  3. Data Structure
  4. Annotations
  5. Additional Information
  6. Updates
  7. Error Reporting
  8. License

1. General

2. Abbreviations

Languages and domains:

Annotation labels:

3. Data Structure

ACTER
├── README.md
├── sources.txt
│
├── en
│   ├── corp
│   │   ├── annotated
│   │   │   ├── annotations
│   │   │   │   ├── sequential_annotations
│   │   │   │   │   ├── io_annotations
│   │   │   │   │   │   ├── with_named_entities
│   │   │   │   │   │   │   ├── corp_en_01_seq_terms_nes.tsv
│   │   │   │   │   │   │   ├── corp_en_02_seq_terms_nes.tsv
│   │   │   │   │   │   │   └── ...
│   │   │   │   │   │   │
│   │   │   │   │   │   └── without_named_entities
│   │   │   │   │   │       ├── corp_en_01_seq_terms.tsv
│   │   │   │   │   │       ├── corp_en_02_seq_terms.tsv
│   │   │   │   │   │       └── ...
│   │   │   │   │   │   
│   │   │   │   │   └── iob_annotations (equivalent to io_annotations)
│   │   │   │   │
│   │   │   │   └── unique_annotation_lists
│   │   │   │       ├── corp_en_terms.tsv
│   │   │   │       ├── corp_en_terms_nes.tsv
│   │   │   │       ├── corp_en_tokenised_terms.tsv
│   │   │   │       └── corp_en_tokenised_terms_nes.tsv
│   │   │   │
│   │   │   ├── texts
│   │   │   └── texts_tokenised
│   │   │ 
│   │   └── unannotated_texts
│   │       ├── corp_en_03.txt
│   │       ├── corp_en_13.txt
│   │       └── ...
│   │
│   ├── equi (equivalent to "corp")
│   │
│   ├── htfl (equivalent to "corp")
│   │
│   └── wind (equivalent to "corp")
│
├── fr (equivalent to "en")
└── nl (equivalent to "en")

4. Annotations

4.1 General

The annotations are provided in simple UTF-8 encoded plain text files. No lemmatisation was performed.

4.2 Sequential annotations

4.2.1 Reference

For an in-depth review of how the sequential labels were obtained and how they relate to the list-versions of the annotations, please check:

Rigouts Terryn, A., Hoste, V., & Lefever, E. (2022). Tagging Terms in Text: A Supervised Sequential Labelling Approach to Automatic Term Extraction. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication, 28(1). https://doi.org/10.1075/term.21010.rig

4.2.2 General

4.2.3 IOB versus IO

IOB (Inside, Outside, beginning): the first token of any annotation gets labelled "B" and each subsequent token of the same annotation gets labelled "I". Tokens that are not part of any annotation are "O".

IO (Inside, Outside): same as IOB but with no distinction between the first and subsequent tokens of an annotation.

Impact: binary labelling (IO) is easier to model, so technically gets higher f1-scores, but loses some detail in case of adjacent annotations. For instance, if "diabetic patients" occurs and both "diabetic" and "patients" are annotated separately, but "diabetic patients" is not annotated as a term, then this can be accurately encoded with IOB labels ("diabetic[B] patients[B]"). With the binary IO scheme, this will become "diabetic[I] patients[I]", which would be the same as if "diabetic patients" were annotated, instead of the two separate entities.

For a more detailed analysis of the difference, see the paper cited in 4.2.1.

4.3 Unique annotations lists

4.3.1 General

4.3.2 Labels

More details on the annotation labels are provided in the main publication accompanying this dataset.

Overview with examples in the domain of heart failure:

4.3.3 Tokenised annotations

Tokenised annotations have a space between each token and are mostly identical to the original annotations, except that they only include those annotations that can be mapped to complete tokens. When an annotation never aligns with token boundaries, it is not included. The differences are minor (see also 5.3 Number of annotations per corpus), but it is important to mention which of the two versions of the data is used.

5. Additional Information

5.1 Websites

5.2 Publications

The dataset has been updated since the publication of the former two papers. These papers also discuss aspects of the data which have not been made available yet, such as cross-lingual annotations and information on the span of the annotations.

5.3 Number of annotations per corpus

5.3.1 Explanation of differences in numbers

5.3.2 Original annotations, with Named Entities

path: language/domain/annotated/annotations/unique_annotation_lists/domain_language_terms_nes.tsv

18,928 Annotations

Domain Language Specific Terms Common Terms OOD Terms Named Entities Total
corp en 278 642 6 247 1173
corp fr 298 675 5 229 1207
corp nl 310 730 6 249 1295
equi en 777 309 69 420 1575
equi fr 701 234 26 220 1181
equi nl 1021 330 41 152 1544
htfl en 1883 319 157 222 2581
htfl fr 1684 487 57 146 2374
htfl nl 1559 449 66 180 2254
wind en 781 296 14 440 1531
wind fr 444 308 21 195 968
wind nl 577 342 21 305 1245

5.3.3 Original annotations, without Named Entities

path: language/domain/annotated/annotations/unique_annotation_lists/domain_language_terms.tsv

15,929 Annotations

Domain Language Specific Terms Common Terms OOD Terms Total
corp en 278 643 6 927
corp fr 298 676 5 979
corp nl 310 731 6 1047
equi en 777 309 69 1155
equi fr 701 234 26 961
equi nl 1022 330 41 1393
htfl en 1884 319 158 2361
htfl fr 1684 487 57 2228
htfl nl 1559 449 66 2074
wind en 781 296 14 1091
wind fr 444 308 21 773
wind nl 577 342 21 940

5.3.4 Tokenised annotations, with Named Entities

path: language/domain/annotated/annotations/unique_annotation_lists/domain_language_tokenised_terms_nes.tsv

18,797 Annotations

Domain Language Specific Terms Common Terms OOD Terms Named Entities Total
corp en 278 641 6 247 1172
corp fr 298 675 5 229 1207
corp nl 308 726 6 249 1287
equi en 769 309 68 420 1561
equi fr 697 234 26 220 1176
equi nl 1020 329 41 152 1541
htfl en 1864 316 157 222 2556
htfl fr 1671 486 57 146 2357
htfl nl 1535 447 65 180 2215
wind en 784 295 13 440 1529
wind fr 443 308 21 195 967
wind nl 571 338 21 305 1229

5.3.5 Tokenised annotations, without Named Entities

path: language/domain/annotated/annotations/unique_annotation_lists/domain_language_tokenised_terms.tsv

15,834 Annotations

Domain Language Specific Terms Common Terms OOD Terms Total
corp en 278 642 6 926
corp fr 298 676 5 979
corp nl 308 727 6 1041
equi en 769 309 68 1146
equi fr 697 234 26 957
equi nl 1021 329 41 1391
htfl en 1865 316 158 2339
htfl fr 1671 486 57 2214
htfl nl 1535 447 65 2047
wind en 784 295 13 1092
wind fr 443 308 21 772
wind nl 571 338 21 930

5.4 Corpus counts (only annotated parts of corpus)

Domain Language # files # sentences # tokens (excl. EOS) # tokens (incl. EOS)
corp en 12 2002 52,847 54,849
corp fr 12 1977 61,107 63,084
corp nl 12 1988 54,233 56,221
equi en 34 3090 61,293 64,383
equi fr 78 2809 63,870 66,679
equi nl 65 3669 60,119 63,788
htfl en 190 2432 57,899 60,331
htfl fr 210 2177 57,204 59,381
htfl nl 174 2880 57,846 60,726
wind en 5 6638 64,404 71,042
wind fr 2 4770 69,759 74,529
wind nl 8 3356 58,684 62,040

5.6 Normalisation

The following normalisation procedures are applied to all available versions of the data:

  1. Unidecode to avoid encoding issues with the "unicodedata" Python package

    normalised_text = unicodedata.normalize("NFC", text_string_to_normalise)
  2. Make sure all dashes and quotes use the same characters

    dashes = ["-", "−", "‐"]
    double_quotes = ['"', '“', '”', '„', "„", "„"]
    single_quotes = ["'", "`", "´", "’", "‘", "’"]
    
    # fix double character quotes
    for double_quote in [',,', "''", "''", "‘’", "’’"]:
        if double_quote in text:
            text_string_to_normalise = text_string_to_normalise.replace(double_quote, '"')
    
    # fix single character dashes and quotes
    normalised_text = ""
    for char in text_string_to_normalise:
        if char in dashes:
            string_normalised += "-"
        elif char in double_quotes:
            string_normalised += '"'
        elif char in single_quotes:
            string_normalised += "'"
        else:
            string_normalised += char
  3. Replace a specifically accented I which could not be handled well with lowercasing

    normalised_text = text_string_to_normalise.replace("İ", "I")
  4. Remove very specific and rare special characters which cause problems with Transformers library

    problem_chars = ["", "", "", "", "œ"]
    for problem_char in problem_chars:
        normalised_text = text_string_to_normalise.replace(problem_char, "")

6. Updates

Changes version 1.0 > version 1.1

Changes version 1.1 > version 1.2

Changes version 1.2 > version 1.3

Changes version 1.3 > version 1.4

Changes version 1.4 > version 1.5

Not many changes to actual annotations, but major update to how the annotations are presented etc.:

7. Error Reporting

The ACTER dataset is an ongoing project, so we are always looking to improve the data. Any questions or issues regarding this dataset may be reported via the Github repository at: https://github.com/AylaRT/ACTER and will be addressed asap.

8. License

The data can be freely used and adapted for non-commercial purposes, provided the above-mentioned paper is cited and any changes made to the data are clearly stated.