AylaRT / ACTER

ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).
19 stars 2 forks source link

Number of Tokens in HTFL English corpus #1

Closed Gumano closed 2 years ago

Gumano commented 2 years ago

Thank you for this great job. For the sake of consistency, I decided to recalculate the number of tokens (excl. EOS) in your corpora. All values are the same except for the number of tokens in HTFL english dataset. My number of tokens is 60858. I use the following code:

import spacy !python -m spacy download en_core_web_sm !python -m spacy download fr_core_news_sm !python -m spacy download nl_core_news_sm

...

if lang=="en": nlp = spacy.load("en_core_web_sm") elif lang=="fr": nlp = spacy.load("fr_core_news_sm") else: nlp = spacy.load("nl_core_news_sm")

Count tokens in each corpus

nlp.tokenizer = Tokenizer(nlp.vocab)

t = 0

for file in texts_dir: with open(root_dir + "annotated/texts_tokenised/" + file) as text_file:

doc = text_file.read()
tokenized_doc = nlp(doc)
t = t + len(tokenized_doc)

print(str(t) + " tokens in " + domain + " " + lang)

Gumano commented 2 years ago

I'm sorry, I found a mistake. the HTFL folder was uploaded incorrectly so some files were uploaded twice.

AylaRT commented 2 years ago

Thank you for looking into it anyway! Do not hesitate to let me know should you have any other questions.