clarinsi / babushka-bench

Benchmarking NLP tools on Slovene, Croatian and Serbian
7 stars 3 forks source link

Macedonian dataset #4

Open stefan-it opened 3 years ago

stefan-it commented 3 years ago

Hi @nljubesi ,

as far as I understand this commit message:

https://github.com/clarinsi/babushka-bench/commit/841c47d5630e1a55cf21659874c5e3af9575b0a6#diff-fd8b5fda8a45abe08c7b3247d4abb7b1395dd3bf6008738388f42ff052bef9fe

The Macedonian dataset comes from the 1984 Multext-east data, but I still have some questions 😅

Many thanks,

Stefan :heart:

nljubesi commented 3 years ago

@stefan-it, hi.

Recently we managed to disambiguate the 1984 Macedonian corpus on the level of morphosyntax, but it is still not officially published as some improvements of the annotation are being performed as we speak (and they do not go along very fast).

We were eager to start experimenting with this notoriously under-resourced language, also wanted to add at least basic support for it to our CLASSLA pipeline, therefore we performed a train:dev:test split of the preliminary data here on babushka-bench.

For what I know?, the corpus still does not contain any NE annotations. We would like to add some for Macedonian in general (not sure whether the 1984 corpus is the best for this task), inter alia, to add Macedonian NER support to CLASSLA. If you happen to know people interested in the task, we have some decent annotation guidelines from other South Slavic languages and quite probably funding available as well. I would not mind hearing on your wider motivation in Macedonian as we are eager to improve support for it on all levels. We will do so also in the MaCoCu project which starts this June (crawling top-level domains of different South-Eastern-European countries, Turkey included, curating / selecting data, building pre-trained language models).

Nice work with the dbmdz models btw, we use them primarily for processing German data. We recently published BERTić, if you happen to be in need of processing Croatian, Serbian etc.

Nikola

stefan-it commented 3 years ago

Hi Nikola,

thanks for your detailed answers!

Sorry for my misunderstanding, the dataset of course has no NE annotations 😅 But talking about NE, you may have noticed that the recent spacy version comes with a (better) support for Macedonian, including a trained model for NER. The author sent me that dataset (see https://twitter.com/_inesmontani/status/1356280197746606099). They plan to release it publicly, so maybe it could also be integrated here for benchmarking.

My colleague and I are working on Macedonian-focussed LMs, so we're primarily looking for datasets for our evaluations. E.g. WikiANN as silver standard is ok, but better datasets are heavily needed :)

I just had a look at the BERTić model, results are really looking good! Have you considered working on an ELECTRA model as well :thinking: For mono-lingual models, I could clearly see a performance boost (did a lot of ELECTRA pre-training for our DBMDZ models recently). However, I tried to train multilingual ELECTRA models (same languages as mBERT), but the performance was not really good, so I'm not sure if this would also be the same for 4 languages :thinking:

nljubesi commented 3 years ago

Stefan, hi.

Busy period. I just contacted Borijan. I will motivate him to publish the dataset if it is CC-BY-SA. No need for all the e-mail writing. :-)

What textual data do you use for building the Macedonian LM? I have a ~320M tokens crawl of the .mk domain (used for building these static embeddings https://www.clarin.si/repository/xmlui/handle/11356/1359) if you can profit from that?

For evaluation of the model, I guess, part-of-speech tagging on our dataset might be a proper way to evaluate? I would be very much interested in hearing details of what exactly you are doing for Macedonian, to coordinate efforts as much as possible. We can also switch to e-mail (nikola tod ljubesic ta jsi tod si).

Regarding BERTić, these are officially four languages, but purely linguistically speaking, these are variants of a single pluricentric language. In other words, I actually did not do any multilingual pre-training via Electra. Good to know that your results were not that good if the need for multilingual training ever arises!