ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).
Thank you for this great job. For the sake of consistency, I decided to recalculate the number of tokens (excl. EOS) in your corpora. All values are the same except for the number of tokens in HTFL english dataset. My number of tokens is 60858. I use the following code:
Thank you for this great job. For the sake of consistency, I decided to recalculate the number of tokens (excl. EOS) in your corpora. All values are the same except for the number of tokens in HTFL english dataset. My number of tokens is 60858. I use the following code:
import spacy !python -m spacy download en_core_web_sm !python -m spacy download fr_core_news_sm !python -m spacy download nl_core_news_sm
...
if lang=="en": nlp = spacy.load("en_core_web_sm") elif lang=="fr": nlp = spacy.load("fr_core_news_sm") else: nlp = spacy.load("nl_core_news_sm")
Count tokens in each corpus
nlp.tokenizer = Tokenizer(nlp.vocab)
t = 0
for file in texts_dir: with open(root_dir + "annotated/texts_tokenised/" + file) as text_file:
print(str(t) + " tokens in " + domain + " " + lang)