Add support for new MasakhaNER v2 dataset

stefan-it commented 2 years ago

Hi,

MasakhaNER v2 was recently accepted at EMNLP 20220 and the new dataset is already online available here.

Preprint is available here.

It should be relatively easy to add this dataset.

The current existing v1 has the following arguments:

https://github.com/flairNLP/flair/blob/8d27a383810455cb45d650e7fe2003384780ae84/flair/datasets/sequence_labeling.py#L2550-L2557

I think we can simply add a version variable and default-set it to v1 to ensure backward compatibility?

Then version dependend-logic such as available languages and GitHub folder paths could be added.

stefan-it commented 2 years ago

This approach would avoid adding a new dataset called e.g. NER_MASAKHANE_v2 which includes duplicated code. What do you thin @alanakbik :thinking:

stefan-it commented 2 years ago

@dadelani told me that one language (Luo, v2) is currently missing due to license issues. So 19/20 languages can be used.

Table 1 of the preprint also includes the number of sentences per split for each language, so it should be relatively easy to also include unit tests.

alanakbik commented 2 years ago

@stefan-it that's great! Yes, I would reuse the old object and add a version parameter. Since V2 will probably be the standard moving forward, perhaps by default it should be set to v2?

flairNLP / flair

Add support for new MasakhaNER v2 dataset #2971