Closed oximer closed 6 years ago
Have the same questions here.
If you only need entity detection, spacy v2.0 has a pretrained multilingual NER model on the Portugese Wikipedia. You can obtain those results without having to train anything.
Install the model:
python -m spacy download xx_ent_wiki_sm-2.0.0-alpha --direct
Then run the following Python code:
from __future__ import unicode_literals
import spacy
import xx_ent_wiki_sm
nlp = xx_ent_wiki_sm.load()
doc = nlp('Quem é Shaka Khan?')
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
doc = nlp('Eu adoro Londres e Berlim')
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Results:
(u'Shaka Khan', 7, 17, u'PER')
(u'Londres', 9, 16, u'LOC')
(u'Berlim', 19, 25, u'LOC')
If I am not mistaken, your training approach won't work without first creating a tokenizer for Portugese.
Sorry about the messy training examples an docs! I spent the past few days going over all examples, cleaning them up and adding more documentation.
Here's the new training examples directory: https://github.com/explosion/spaCy/tree/develop/examples/training
The current state only works with the spaCy version on develop
– which will be released as soon as the new models are done training. The new docs are already in the website
directory on develop
, but not live yet, since we want to push the new version first.
(Unless there are serious bugs or problems, the upcoming alpha version will probably also be the version we'll promote to the release candidate 🎉 )
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I was reading this Example and wondering if I can train a NER model use the Portuguese language of the Spacy 2.0
I know that we don't have a model for Portuguese yet, but the example above is not really using a model. Instead, it is loading a language with a custom pipeline. To be more clear, the example isn't using neither the tagger or the parser.
Thus, which result I should expect if I execute something like this.
Results
First, this code make any sense? I discuss with other member of SpaCy community and many of them told that the they trained a NER for Portuguese without a formal SpaCy Model for Portuguese. I guess that they use the "simple" tokenizer available for Portuguese, basically using .split(' ');
Second, How the tagger and parser impacts the NER? Do I really need them, if I just want to identify intent and entities?