Retrain NER comoonent with gold standard from SUC 3.0?

EmilStenstrom commented 2 years ago

In the readme about NER I see that you write that entities were automatically annotated by Sparv. That's true for the <ne type=X>-tags, but there are also <name type=X>-tags which are manually annotated. There are some issues with them too, but I think I've ironed that out in https://github.com/EmilStenstrom/suc_to_iob/ by picking the name tags sometimes, and ne tags sometimes (see readme for algorithm).

Would it be feasible to retrain the NER component with an updated dataset based on suc_to_iob?

Nuccy90 commented 2 years ago

Hi, we're going to make a spaCy project based on this version of SUC and it will be used as a starting point to train official Swedish models. I hope that helps!

EmilStenstrom commented 2 years ago

Hi. I understand the thinking behind using an official project like that. Looking at the actual annotations in that dataset, there are lots of things that makes no sense in both the manual and automated annotations. The only way (again, looking at the actual annotations) I can make sense of that dataset is by picking the manual tags when they make sense, and the automated tags when they make sense. suc_to_iob does that, and therefore will make for better ML models.

What problems does the manual annotations have, that can be fixed by selectively using the <ne> tags?

Many countries are manually annotated as "inst" when they should be "place"
"Person" includes titles so "bror Ture" instead of just "Ture"
"other" is just a mess and can't be used
"animal" and "myth" are basically names, which according to i.e. the annotator guide for the CoNLL2003 data says should be marked as "person". They are also very few compared to the other tag. In practice, this just confuses models trained on that data.

They say a model is only as good as the data it's trained on; are you sure you should use a dataset with the above issues?

Nuccy90 commented 2 years ago

We are going to use the simple_tags version in the Huggingface dataset above, so most of the problems that you mention should be fixed. The logic is to use the automatic tags that match the manual tags when possible, without losing the classes that make more sense in the machine annotation.

EmilStenstrom commented 2 years ago

The logic is to use the automatic tags that match the manual tags when possible, without losing the classes that make more sense in the machine annotation.

This seems like a similar approach to the one I use. Good luck!

Kungbib / swedish-spacy

Retrain NER comoonent with gold standard from SUC 3.0? #10