explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.15k stars 4.4k forks source link

French Named Entity Extraction issues #1906

Closed shawn-mccorkell closed 6 years ago

shawn-mccorkell commented 6 years ago

I was playing around with the default French model using your visualizer for Entity Extraction.

I noticed that it has a hard time extracting the Person entities for French. Organization and Location seem pretty accurate.

Is this normal, or just not working in the visualizer? We do plan to train our own models with our data.

I am trying with this text and I figure the default model should extract Justin Trudeau and Donald Trump.

En plus d'essuyer les tirs nourris de ses adversaires sur les questions d'éthique, Justin Trudeau devra composer avec trois incertitudes tout au long de la session parlementaire qui s'étirera au moins jusqu'à la mi-juin.

Depuis 12 mois, le gouvernement Trudeau multiplie les gestes et les missions aux États-Unis afin de convaincre l'administration de Donald Trump de ne pas déchirer l'Accord de libre-échange nord-américain (ALENA). Mais le président continue de dénoncer cet accord commercial, en vigueur depuis 1994. Il l'a fait encore vendredi durant sa visite éclair à Davos, en Suisse, où avait lieu la semaine dernière le Forum économique mondial. Les nombreuses sorties du président américain ont laissé peu de choix au gouvernement Trudeau.

Vizualizer Test

ines commented 6 years ago

The visualizer uses the same model(s) available for download with spaCy, so the output should be identical. The French NER was trained on the WikiNER corpus (see: Learning multilingual named entity recognition from Wikipedia, Nothman et al., 2013), the parser and tagger on the Universal Dependencies Corpus.

If you're working with named entities, training on your own, specific data is definitely important. The entities the model recognises strongly depend on the data it was trained on. Texts on Wikipedia contain a lot of entities – but they're also quite different from news article texts. The recency of the training data can also have an impact. A corpus from 2010 is considered fairly recent – but a model trained on it will not have seen examples of a lot of entities that are pretty relevant today.

"Trump" is also a good example of that. Many English models trained on the standard corpora often tend to tag it as an ORG instead of a PERSON, because the data they were trained on contained more mentions of Trump hotels and other business ventures, and fewer mentions of the person (which is very different from the usage of that entity today).

If you have data available that's close to the texts your model will have to process at runtime, you should be able to update the French model pretty easily, and correct its predictions. You can find examples of this in the training documentation. The examples/training directory also has a range of scripts available that you can modify and run.

shawn-mccorkell commented 6 years ago

Thanks for the detailed explanation, very helpful!

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.