explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.37k stars 4.33k forks source link

Training a NER for a Language #1159

Closed oximer closed 6 years ago

oximer commented 7 years ago

I was reading this Example and wondering if I can train a NER model use the Portuguese language of the Spacy 2.0

I know that we don't have a model for Portuguese yet, but the example above is not really using a model. Instead, it is loading a language with a custom pipeline. To be more clear, the example isn't using neither the tagger or the parser.

Thus, which result I should expect if I execute something like this.

# coding: utf-8
from __future__ import unicode_literals

import random
from spacy.lang.pt import Portuguese
from spacy.gold import GoldParse, biluo_tags_from_offsets

def main(model_dir=None):
    train_data = [
        ('Quem é Shaka Khan?',
            [(len('Quem é '), len('Quem é Shaka Khan'), 'PERSON')]),
        ('Eu adoro Londres e Berlim',
            [(len('Eu adoro '), len('Eu adoro Londres'), 'LOC'),
             (len('Eu adoro Londres e '), len('Eu adoro Londres e Berlim'), 'LOC')])
    ]
    nlp = Portuguese(pipeline=['tensorizer', 'ner'])

    print('starting')

    def get_data(): return reformat_train_data(nlp.tokenizer, train_data)
    optimizer = nlp.begin_training(get_data)
    for itn in range(100):
        random.shuffle(train_data)
        losses = {}
        for raw_text, entity_offsets in train_data:
            doc = nlp.make_doc(raw_text)
            gold = GoldParse(doc, entities=entity_offsets)
            nlp.update([doc], [gold], drop=0.5, sgd=optimizer, losses=losses)
    nlp.to_disk('./trash_model')

def reformat_train_data(tokenizer, examples):
    """Reformat data to match JSON format"""
    print(examples)
    output = []
    for i, (text, entity_offsets) in enumerate(examples):
        doc = tokenizer(text)
        ner_tags = biluo_tags_from_offsets(tokenizer(text), entity_offsets)
        words = [w.text for w in doc]
        tags = ['-'] * len(doc)
        heads = [0] * len(doc)
        deps = [''] * len(doc)
        sentence = (range(len(doc)), words, tags, heads, deps, ner_tags)
        output.append((text, [(sentence, [])]))
    return output

if __name__ == '__main__':
    main()
# coding: utf-8
from __future__ import unicode_literals

from spacy.lang.pt import Portuguese

nlp = Portuguese(pipeline=['tensorizer', 'ner']).from_disk('./trash_model')

doc = nlp('Quem é Shaka Khan?')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

doc = nlp('Eu adoro Londres e Berlim')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Results

u'Shaka Khan', 7, 17, u'PERSON')
(u'Londres', 9, 16, u'LOC')
(u'Berlim', 19, 25, u'LOC')

First, this code make any sense? I discuss with other member of SpaCy community and many of them told that the they trained a NER for Portuguese without a formal SpaCy Model for Portuguese. I guess that they use the "simple" tokenizer available for Portuguese, basically using .split(' ');

Second, How the tagger and parser impacts the NER? Do I really need them, if I just want to identify intent and entities?

vinistig commented 7 years ago

Have the same questions here.

twielfaert commented 7 years ago

If you only need entity detection, spacy v2.0 has a pretrained multilingual NER model on the Portugese Wikipedia. You can obtain those results without having to train anything.

Install the model:

python -m spacy download xx_ent_wiki_sm-2.0.0-alpha --direct

Then run the following Python code:

from __future__ import unicode_literals

import spacy
import xx_ent_wiki_sm

nlp = xx_ent_wiki_sm.load()

doc = nlp('Quem é Shaka Khan?')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

doc = nlp('Eu adoro Londres e Berlim')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Results:

(u'Shaka Khan', 7, 17, u'PER')
(u'Londres', 9, 16, u'LOC')
(u'Berlim', 19, 25, u'LOC')

If I am not mistaken, your training approach won't work without first creating a tokenizer for Portugese.

ines commented 6 years ago

Sorry about the messy training examples an docs! I spent the past few days going over all examples, cleaning them up and adding more documentation.

Here's the new training examples directory: https://github.com/explosion/spaCy/tree/develop/examples/training

The current state only works with the spaCy version on develop – which will be released as soon as the new models are done training. The new docs are already in the website directory on develop, but not live yet, since we want to push the new version first.

(Unless there are serious bugs or problems, the upcoming alpha version will probably also be the version we'll promote to the release candidate 🎉 )

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.