Closed nengine closed 7 years ago
Hi! Thanks a lot for your interest. I think in the v2 branch we've finally done enough to make this easy, so I'm very interested to find out your experience.
Should I use version 1 or 2 to train a new language model? I understand verion 2 is still alpha, but curious if I can already use it to train a new language.
Definitely v2. There are so any steps you don't have to do.
Please provide an example of input data to train NER model? I saw in the example directory using a format like below to train German NER, but documentation indicates to use json format, and I am not sure which one to us
The easiest thing is to put your data into the .iob format, and use spacy convert
. An example of the .iob format:
The|DT|I-MISC Oxford|NNP|I-MISC Companion|NNP|I-MISC to|TO|I-MISC Philosophy|NNP|I-MISC says|VBZ|O ,|,|O "|LQU|O there|EX|O is|VBZ|O no|DT|O single|JJ|O defining|VBG|O position|NN|O that|IN|O all|DT|O anarchists|NNS|O hold|VBP|O ,|,|O and|CC|O those|DT|O considered|VBN|O anarchists|NNS|O at|IN|O best|JJS|O share|NN|O a|DT|O certain|JJ|O family|NN|O resemblance|NN|O .|.|O "|RQU|O
In|IN|O the|DT|O end|NN|O ,|,|O for|IN|O anarchist|JJ|O historian|JJ|O Daniel|NNP|I-PER Guerin|NNP|I-PER "|LQU|O Some|DT|O anarchists|NNS|O are|VBP|O more|RBR|O individualistic|JJ|O than|IN|O social|JJ|O ,|,|O some|DT|O more|JJR|O social|JJ|O than|IN|O individualistic|JJ|O .|.|O
That's one sentence per line, in the format 'word|tag|NER.` If you have your text in documents or paragraphs, you should add one newline in between these units, so that they can be grouped together.
If you don't have tags, you should be able to just use any tag value, e.g. -
. You should then train with the flags -P
and -T
, to disable the tagger and parser during training.
I will have to prepare the NER tags and will post the results. Thanks a lot.
Ok to use either of IOB1 and IOB2 format? Thanks.
I don't remember which is which, but I think you want the one where all entities begin with B. I notice that's not what's in the snippet above -- I need to double check whether this causes problems!
Please see here --- the training is much improved in v2, and we've tried to give a lot more guidance about how to make good use of it: https://spacy.io/usage/training
Thank you @honnibal! Please let me know if I can still use 'spacy convert' from .iob files for training in v2.0? In the example train_ner.py it is showing training data format as shown below. Also, it is possible to train NER from scratch without an existing language model in v2.0?
# training data
TRAIN_DATA = [
('Who is Shaka Khan?', {
'entities': [(7, 17, 'PERSON')]
}),
('I like London and Berlin.', {
'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
})
]
Please let me know if I can still use 'spacy convert' from .iob files for training in v2.0?
Yes, this still works. See here. This will produce a JSON file to use with the train
command. Also see the new guide on training models for more details and examples.
Also, it is possible to train NER from scratch without an existing language model in v2.0?
Yes ā you can start off with a blank language. If you've added the language data for Khmer, you can do nlp = spacy.load('km')
to create a new Language
class. Even if you don't have any language data yet ā all you need to do is create a spacy/lang/km/__init__.py
setting up the class.
Also see train_new_entity_type.py
and train_ner.py
for the updated training examples.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I would like to add a new language for Kmer, but since only NER data is available is it possible to a add new language support excluding Tagger and Parser?
Thanks a lot.
Your Environment