AylaRT / ACTER

ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).
19 stars 2 forks source link

New labeling regimes for ACTER datasets. #3

Open honghanhh opened 10 months ago

honghanhh commented 10 months ago

Hi @AylaRT, Thanks for the contribution of ACTER corpora, which is very meaningful for term extraction.

While working on the datasets, we discovered that the current token classifiers with the BIO annotation regime do perform not so well on nested terms. Thus, we would like to propose a new annotation regime where we also annotate single-word nested terms.

Please take a look at the new annotation, which can be seen via this link: https://github.com/honghanhh/nobi_annotation_regime

It would be nice if we could integrate our proposals as the next version of the corpora. Please let us know if you need any further information in advance.

Thanks a lot. Kind regards, Hanh

AylaRT commented 10 months ago

Hi @honghanhh,

Thank you for the kind message and the potential improvement for the dataset! I will definitely add the information for the next version. I cannot guarantee that will be very soon due to time restrictions, but I will keep you posted.

kind regards, Ayla