ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).
Readme structure:
Languages and domains:
Annotation labels:
ACTER
├── README.md
├── sources.txt
│
├── en
│ ├── corp
│ │ ├── annotated
│ │ │ ├── annotations
│ │ │ │ ├── sequential_annotations
│ │ │ │ │ ├── io_annotations
│ │ │ │ │ │ ├── with_named_entities
│ │ │ │ │ │ │ ├── corp_en_01_seq_terms_nes.tsv
│ │ │ │ │ │ │ ├── corp_en_02_seq_terms_nes.tsv
│ │ │ │ │ │ │ └── ...
│ │ │ │ │ │ │
│ │ │ │ │ │ └── without_named_entities
│ │ │ │ │ │ ├── corp_en_01_seq_terms.tsv
│ │ │ │ │ │ ├── corp_en_02_seq_terms.tsv
│ │ │ │ │ │ └── ...
│ │ │ │ │ │
│ │ │ │ │ └── iob_annotations (equivalent to io_annotations)
│ │ │ │ │
│ │ │ │ └── unique_annotation_lists
│ │ │ │ ├── corp_en_terms.tsv
│ │ │ │ ├── corp_en_terms_nes.tsv
│ │ │ │ ├── corp_en_tokenised_terms.tsv
│ │ │ │ └── corp_en_tokenised_terms_nes.tsv
│ │ │ │
│ │ │ ├── texts
│ │ │ └── texts_tokenised
│ │ │
│ │ └── unannotated_texts
│ │ ├── corp_en_03.txt
│ │ ├── corp_en_13.txt
│ │ └── ...
│ │
│ ├── equi (equivalent to "corp")
│ │
│ ├── htfl (equivalent to "corp")
│ │
│ └── wind (equivalent to "corp")
│
├── fr (equivalent to "en")
└── nl (equivalent to "en")
README.md, sources.txt
At the first level, there are two files with information about the dataset: the current README.md file and sources.txt, which mentions the sources of all texts in the dataset.
languages and language/domains
At the first level, there is also one directory per language with an identical structure of subdirectories and files for each language. At the second level, there are four directories, i.e., one per domain, each with an identical structure of subdirectories and files. The corpora in each domain are comparable per language (i.e., similar size, topic, style). Only the corruption (corp) corpus is parallel, i.e., translations.
language/domain/unannotated_texts
Per domain, there are annotated and unannotated texts. For the unannotated texts, only the original (normalised) texts themselves are offered as .txt-files.
language/domain/annotated
For the annotated texts, many types of information are available, ordered in subdirectories.
language/domain/annotated/annotations
The annotations can be found here, ordered in subdirectories for different formats of the data.
language/domain/annotated/texts and language/domain/annotated/texts_tokenised
The texts of the annotated corpora can be found here, with the original (normalised) texts and the (normalised) tokenised texts in different directories. The texts were tokenised with LeTs PreProcess*, with one sentence per line and spaces between all tokens.
language/domain/annotated/annotations/sequential_annotations
Sequential annotations always have one token per line, followed by a tab and a sequential label (more info in next section). There are empty lines between sentences.
language/domain/annotated/annotations/unique_annotation_lists
Lists of all unique annotations (lowercased, unlemmatised) for the entire corpus (langauge-domain), with one annotation per line, followed by a tab and its label (Specific_Term, Common_Term, OOD_Term, or Named Entity).
The annotations are provided in simple UTF-8 encoded plain text files. No lemmatisation was performed.
For an in-depth review of how the sequential labels were obtained and how they relate to the list-versions of the annotations, please check:
Rigouts Terryn, A., Hoste, V., & Lefever, E. (2022). Tagging Terms in Text: A Supervised Sequential Labelling Approach to Automatic Term Extraction. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication, 28(1). https://doi.org/10.1075/term.21010.rig
IOB (Inside, Outside, beginning): the first token of any annotation gets labelled "B" and each subsequent token of the same annotation gets labelled "I". Tokens that are not part of any annotation are "O".
IO (Inside, Outside): same as IOB but with no distinction between the first and subsequent tokens of an annotation.
Impact: binary labelling (IO) is easier to model, so technically gets higher f1-scores, but loses some detail in case of adjacent annotations. For instance, if "diabetic patients" occurs and both "diabetic" and "patients" are annotated separately, but "diabetic patients" is not annotated as a term, then this can be accurately encoded with IOB labels ("diabetic[B] patients[B]"). With the binary IO scheme, this will become "diabetic[I] patients[I]", which would be the same as if "diabetic patients" were annotated, instead of the two separate entities.
For a more detailed analysis of the difference, see the paper cited in 4.2.1.
More details on the annotation labels are provided in the main publication accompanying this dataset.
Overview with examples in the domain of heart failure:
Tokenised annotations have a space between each token and are mostly identical to the original annotations, except that they only include those annotations that can be mapped to complete tokens. When an annotation never aligns with token boundaries, it is not included. The differences are minor (see also 5.3 Number of annotations per corpus), but it is important to mention which of the two versions of the data is used.
The dataset has been updated since the publication of the former two papers. These papers also discuss aspects of the data which have not been made available yet, such as cross-lingual annotations and information on the span of the annotations.
path: language/domain/annotated/annotations/unique_annotation_lists/domain_language_terms_nes.tsv
18,928 Annotations
Domain | Language | Specific Terms | Common Terms | OOD Terms | Named Entities | Total |
---|---|---|---|---|---|---|
corp | en | 278 | 642 | 6 | 247 | 1173 |
corp | fr | 298 | 675 | 5 | 229 | 1207 |
corp | nl | 310 | 730 | 6 | 249 | 1295 |
equi | en | 777 | 309 | 69 | 420 | 1575 |
equi | fr | 701 | 234 | 26 | 220 | 1181 |
equi | nl | 1021 | 330 | 41 | 152 | 1544 |
htfl | en | 1883 | 319 | 157 | 222 | 2581 |
htfl | fr | 1684 | 487 | 57 | 146 | 2374 |
htfl | nl | 1559 | 449 | 66 | 180 | 2254 |
wind | en | 781 | 296 | 14 | 440 | 1531 |
wind | fr | 444 | 308 | 21 | 195 | 968 |
wind | nl | 577 | 342 | 21 | 305 | 1245 |
path: language/domain/annotated/annotations/unique_annotation_lists/domain_language_terms.tsv
15,929 Annotations
Domain | Language | Specific Terms | Common Terms | OOD Terms | Total |
---|---|---|---|---|---|
corp | en | 278 | 643 | 6 | 927 |
corp | fr | 298 | 676 | 5 | 979 |
corp | nl | 310 | 731 | 6 | 1047 |
equi | en | 777 | 309 | 69 | 1155 |
equi | fr | 701 | 234 | 26 | 961 |
equi | nl | 1022 | 330 | 41 | 1393 |
htfl | en | 1884 | 319 | 158 | 2361 |
htfl | fr | 1684 | 487 | 57 | 2228 |
htfl | nl | 1559 | 449 | 66 | 2074 |
wind | en | 781 | 296 | 14 | 1091 |
wind | fr | 444 | 308 | 21 | 773 |
wind | nl | 577 | 342 | 21 | 940 |
path: language/domain/annotated/annotations/unique_annotation_lists/domain_language_tokenised_terms_nes.tsv
18,797 Annotations
Domain | Language | Specific Terms | Common Terms | OOD Terms | Named Entities | Total |
---|---|---|---|---|---|---|
corp | en | 278 | 641 | 6 | 247 | 1172 |
corp | fr | 298 | 675 | 5 | 229 | 1207 |
corp | nl | 308 | 726 | 6 | 249 | 1287 |
equi | en | 769 | 309 | 68 | 420 | 1561 |
equi | fr | 697 | 234 | 26 | 220 | 1176 |
equi | nl | 1020 | 329 | 41 | 152 | 1541 |
htfl | en | 1864 | 316 | 157 | 222 | 2556 |
htfl | fr | 1671 | 486 | 57 | 146 | 2357 |
htfl | nl | 1535 | 447 | 65 | 180 | 2215 |
wind | en | 784 | 295 | 13 | 440 | 1529 |
wind | fr | 443 | 308 | 21 | 195 | 967 |
wind | nl | 571 | 338 | 21 | 305 | 1229 |
path: language/domain/annotated/annotations/unique_annotation_lists/domain_language_tokenised_terms.tsv
15,834 Annotations
Domain | Language | Specific Terms | Common Terms | OOD Terms | Total |
---|---|---|---|---|---|
corp | en | 278 | 642 | 6 | 926 |
corp | fr | 298 | 676 | 5 | 979 |
corp | nl | 308 | 727 | 6 | 1041 |
equi | en | 769 | 309 | 68 | 1146 |
equi | fr | 697 | 234 | 26 | 957 |
equi | nl | 1021 | 329 | 41 | 1391 |
htfl | en | 1865 | 316 | 158 | 2339 |
htfl | fr | 1671 | 486 | 57 | 2214 |
htfl | nl | 1535 | 447 | 65 | 2047 |
wind | en | 784 | 295 | 13 | 1092 |
wind | fr | 443 | 308 | 21 | 772 |
wind | nl | 571 | 338 | 21 | 930 |
Domain | Language | # files | # sentences | # tokens (excl. EOS) | # tokens (incl. EOS) |
---|---|---|---|---|---|
corp | en | 12 | 2002 | 52,847 | 54,849 |
corp | fr | 12 | 1977 | 61,107 | 63,084 |
corp | nl | 12 | 1988 | 54,233 | 56,221 |
equi | en | 34 | 3090 | 61,293 | 64,383 |
equi | fr | 78 | 2809 | 63,870 | 66,679 |
equi | nl | 65 | 3669 | 60,119 | 63,788 |
htfl | en | 190 | 2432 | 57,899 | 60,331 |
htfl | fr | 210 | 2177 | 57,204 | 59,381 |
htfl | nl | 174 | 2880 | 57,846 | 60,726 |
wind | en | 5 | 6638 | 64,404 | 71,042 |
wind | fr | 2 | 4770 | 69,759 | 74,529 |
wind | nl | 8 | 3356 | 58,684 | 62,040 |
The following normalisation procedures are applied to all available versions of the data:
Unidecode to avoid encoding issues with the "unicodedata" Python package
normalised_text = unicodedata.normalize("NFC", text_string_to_normalise)
Make sure all dashes and quotes use the same characters
dashes = ["-", "−", "‐"]
double_quotes = ['"', '“', '”', '„', "„", "„"]
single_quotes = ["'", "`", "´", "’", "‘", "’"]
# fix double character quotes
for double_quote in [',,', "''", "''", "‘’", "’’"]:
if double_quote in text:
text_string_to_normalise = text_string_to_normalise.replace(double_quote, '"')
# fix single character dashes and quotes
normalised_text = ""
for char in text_string_to_normalise:
if char in dashes:
string_normalised += "-"
elif char in double_quotes:
string_normalised += '"'
elif char in single_quotes:
string_normalised += "'"
else:
string_normalised += char
Replace a specifically accented I which could not be handled well with lowercasing
normalised_text = text_string_to_normalise.replace("İ", "I")
Remove very specific and rare special characters which cause problems with Transformers library
problem_chars = ["", "", "", "", ""]
for problem_char in problem_chars:
normalised_text = text_string_to_normalise.replace(problem_char, "")
Not many changes to actual annotations, but major update to how the annotations are presented etc.:
The ACTER dataset is an ongoing project, so we are always looking to improve the data. Any questions or issues regarding this dataset may be reported via the Github repository at: https://github.com/AylaRT/ACTER and will be addressed asap.
The data can be freely used and adapted for non-commercial purposes, provided the above-mentioned paper is cited and any changes made to the data are clearly stated.