Closed stefan-it closed 1 year ago
This approach would avoid adding a new dataset called e.g. NER_MASAKHANE_v2
which includes duplicated code. What do you thin @alanakbik :thinking:
@dadelani told me that one language (Luo, v2) is currently missing due to license issues. So 19/20 languages can be used.
Table 1 of the preprint also includes the number of sentences per split for each language, so it should be relatively easy to also include unit tests.
@stefan-it that's great! Yes, I would reuse the old object and add a version parameter. Since V2 will probably be the standard moving forward, perhaps by default it should be set to v2
?
Hi,
MasakhaNER v2 was recently accepted at EMNLP 20220 and the new dataset is already online available here.
Preprint is available here.
It should be relatively easy to add this dataset.
The current existing v1 has the following arguments:
https://github.com/flairNLP/flair/blob/8d27a383810455cb45d650e7fe2003384780ae84/flair/datasets/sequence_labeling.py#L2550-L2557
I think we can simply add a
version
variable and default-set it tov1
to ensure backward compatibility?Then version dependend-logic such as available languages and GitHub folder paths could be added.