flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.82k stars 2.09k forks source link

[Feature]: Add support for MultiCoNER v2 Dataset #3352

Open stefan-it opened 11 months ago

stefan-it commented 11 months ago

Problem statement

Hi,

there's a new EMNLP 2023 paper that introduces version 2 of MultiCoNER dataset.

MultiCoNER v2 should also be supported in Flair :hugs:

Solution

The dataset is hosted on the Hugging Face Model Hub:

https://huggingface.co/datasets/MultiCoNER/multiconer_v2/tree/main

Train, Development and Testfiles can also be accessed there, e.g. see files for German:

https://huggingface.co/datasets/MultiCoNER/multiconer_v2/tree/main/DE-German

Additional Context

It should be discussed, if we can extend the existing NER_MULTI_CONER implementation, and add a version tag to it:

https://github.com/flairNLP/flair/blob/ed53c42ec2e8d8abbd07acd7f6b531945ac45606/flair/datasets/sequence_labeling.py#L3048C7-L3055

class NER_MULTI_CONER(MultiFileColumnCorpus):
    def __init__(
        self,
        task: str = "multi",
        version: str = "v1",
        base_path: Optional[Union[str, Path]] = None,
        in_memory: bool = True,
        **corpusargs,
    ) -> None:

The version parameter is then set to v1 to ensure backward-compatibility :thinking:

alanakbik commented 11 months ago

I agree @stefan-it - that would be great to add!