flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.7k stars 2.08k forks source link

Support for NoisyNER dataset #3463

Open teresaloeffelhardt opened 1 month ago

teresaloeffelhardt commented 1 month ago

Hi,

This PR adds support for the NoisyNER dataset as proposed in this paper and released in this repo:

NoisyNER is a dataset for the evaluation of methods to handle noisy labels when training machine learning models. It is from the NLP/Information Extraction domain and was created through a realistic distant supervision technique. Some highlights and interesting aspects of the data are:

  • Seven sets of labels with differing noise patterns to evaluate different noise levels on the same instances
  • Full parallel clean labels available to compute upper performance bounds or study scenarios where a small amount of gold-standard data can be leveraged
  • Skewed label distribution (typical for Named Entity Recognition tasks)
  • For some label sets: noise level higher than the true label probability
  • Sequential dependencies between the labels