explosion / spaCy

šŸ’« Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.3k stars 4.41k forks source link

Add Kher languge support only for NER in Spacy #1399

Closed nengine closed 7 years ago

nengine commented 7 years ago

I would like to add a new language for Kmer, but since only NER data is available is it possible to a add new language support excluding Tagger and Parser?

  1. Should I use version 1 or 2 to train a new language model? I understand verion 2 is still alpha, but curious if I can already use it to train a new language.
  2. Please provide an example of input data to train NER model? I saw in the example directory using a format like below to train German NER, but documentation indicates to use json format, and I am not sure which one to use.

Thanks a lot.

1   Gleich  O   O
2   darauf  O   O
3   entwirft    O   O
4   er  O   O
5   seine   O   O
6   Selbstdarstellung   O   O
7   "   O   O
8   Ecce    B-OTH   O
9   homo    I-OTH   O
10  "   O   O
11  in  O   O
12  enger   O   O
13  Auseinandersetzung  O   O
14  mit O   O
15  diesem  O   O
16  Bild    O   O
17  Jesu    B-PER   O
18  .   O   O

Your Environment

honnibal commented 7 years ago

Hi! Thanks a lot for your interest. I think in the v2 branch we've finally done enough to make this easy, so I'm very interested to find out your experience.

Should I use version 1 or 2 to train a new language model? I understand verion 2 is still alpha, but curious if I can already use it to train a new language.

Definitely v2. There are so any steps you don't have to do.

Please provide an example of input data to train NER model? I saw in the example directory using a format like below to train German NER, but documentation indicates to use json format, and I am not sure which one to us

The easiest thing is to put your data into the .iob format, and use spacy convert. An example of the .iob format:

The|DT|I-MISC Oxford|NNP|I-MISC Companion|NNP|I-MISC to|TO|I-MISC Philosophy|NNP|I-MISC says|VBZ|O ,|,|O "|LQU|O there|EX|O is|VBZ|O no|DT|O single|JJ|O defining|VBG|O position|NN|O that|IN|O all|DT|O anarchists|NNS|O hold|VBP|O ,|,|O and|CC|O those|DT|O considered|VBN|O anarchists|NNS|O at|IN|O best|JJS|O share|NN|O a|DT|O certain|JJ|O family|NN|O resemblance|NN|O .|.|O "|RQU|O
In|IN|O the|DT|O end|NN|O ,|,|O for|IN|O anarchist|JJ|O historian|JJ|O Daniel|NNP|I-PER Guerin|NNP|I-PER "|LQU|O Some|DT|O anarchists|NNS|O are|VBP|O more|RBR|O individualistic|JJ|O than|IN|O social|JJ|O ,|,|O some|DT|O more|JJR|O social|JJ|O than|IN|O individualistic|JJ|O .|.|O

That's one sentence per line, in the format 'word|tag|NER.` If you have your text in documents or paragraphs, you should add one newline in between these units, so that they can be grouped together.

If you don't have tags, you should be able to just use any tag value, e.g. -. You should then train with the flags -P and -T, to disable the tagger and parser during training.

nengine commented 7 years ago

I will have to prepare the NER tags and will post the results. Thanks a lot.

nengine commented 7 years ago

Ok to use either of IOB1 and IOB2 format? Thanks.

honnibal commented 7 years ago

I don't remember which is which, but I think you want the one where all entities begin with B. I notice that's not what's in the snippet above -- I need to double check whether this causes problems!

honnibal commented 7 years ago

Please see here --- the training is much improved in v2, and we've tried to give a lot more guidance about how to make good use of it: https://spacy.io/usage/training

nengine commented 7 years ago

Thank you @honnibal! Please let me know if I can still use 'spacy convert' from .iob files for training in v2.0? In the example train_ner.py it is showing training data format as shown below. Also, it is possible to train NER from scratch without an existing language model in v2.0?

# training data
TRAIN_DATA = [
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]
ines commented 7 years ago

Please let me know if I can still use 'spacy convert' from .iob files for training in v2.0?

Yes, this still works. See here. This will produce a JSON file to use with the train command. Also see the new guide on training models for more details and examples.

Also, it is possible to train NER from scratch without an existing language model in v2.0?

Yes ā€“ you can start off with a blank language. If you've added the language data for Khmer, you can do nlp = spacy.load('km') to create a new Language class. Even if you don't have any language data yet ā€“ all you need to do is create a spacy/lang/km/__init__.py setting up the class.

Also see train_new_entity_type.py and train_ner.py for the updated training examples.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.