chokkan / crfsuite

CRFsuite: a fast implementation of Conditional Random Fields (CRFs)
http://www.chokkan.org/software/crfsuite/
Other
641 stars 208 forks source link

Non-english language training data for NER #92

Closed uugan closed 7 years ago

uugan commented 7 years ago

Is there any one knows about how to create training data set for crfsuite for named-entity recognition? How to convert data set like below to crfsuite structure? ... Commission NNP I-NP I-ORG said VBD I-VP O on IN I-PP O Thursday NNP I-NP O it PRP B-NP O disagreed VBD I-VP O with IN I-PP O German JJ I-NP I-MISC ...

usptact commented 7 years ago

Do you know what is the format that CRFSuite accepts? Did you check the docs?

uugan commented 7 years ago

I'm not sure but i read from here : http://www.chokkan.org/software/crfsuite/manual.html#idp8853717120 But it's too complicated to create [Conll2000]. Is that possible to train data set like above[Conll2003]?

usptact commented 7 years ago

Yes, you need to write a python script that extracts the features from your data file. I believe that there is such a script in CRFSuite repo directories somewhere.

If you read the feature file that you linked, do you understand what each block, each line and elements mean?

uugan commented 7 years ago

If you read the feature file that you linked, do you understand what each block, each line and elements mean?

Not fully understand. In that manual shown only POS-Tagging. But I to need train data for NER(tags like organisation, person, location etc). How to create that kind of data for training? Also this library only works with file input & output by tag function. Is that possible to add function which works with input string for predicting/testing?

usptact commented 7 years ago

The principle is exactly the same. You need to write a script that will output features for each token in a sequence (e.g. "current word", "previous word", "next word" etc). For each token you also need to provide the named entity label, similarly as with POS.

if you look carefully at the features (column 2 and after that), you will notice that those features can be a good starting point for your NER model as well.

The manual page you linked is actually pretty good at explaining. Is there something in particular that you didn't understand there?

Take a look at the crfsuite/example/ner.py

usptact commented 7 years ago

Let's say that you have a string like "my name is uugan" and you want "uugan" tagged as PERSON.

Your features file can look something like this:

O w[0]=my w[-1]=NULL w[+1]=is
O w[0]=is w[-1]=my w[+1]=name
O w[0]=name w[-1]=is w[+1]=uugan
B-PERSON w[0]=uugan w[-1]=name w[+1]=NULL

You provide a file like this with as many empty line delimited blocks like this. Every blank line separates a string from another.

Look at the features and make sure you understand how I built them. Note that CRFSuite does not care about actual string values. "w[0]=my" is as different from "w[-1]=NULL" as with "w[+1]=is". What count are presence or absence of features. It is up to you to come up with features.

uugan commented 7 years ago

I've just tested with win32 binary. And learn function is got stuck. train file(is it ok without pos array? in ner.py : Field names of the input data. fields = 'y w pos chk'):

O   w[0]=my w[-1]=NULL w[1]=name __BOS__
O   w[0]=name w[-1]=my w[1]=is
B-PERSON    w[0]=uugan w[-1]=is w[1]=NULL __EOS__

Start to train(learn):

>crfsuite.exe learn -m crfsuite_train_ner.model crfsuite_train_ner.txt
CRFSuite 0.12  Copyright (c) 2007-2011 Naoaki Okazaki

Start time of the training: 2017-09-02T11:47:13Z

Reading the data set(s)
[1] conll2000\crfsuite_train_ner.txt
0....1....2....3....4....5....6....7....8....9....10

but after that line nothing prints in console. Is that right data format? POS-tagging format trained OK. But our NER format is wrong. Added pos array in train file also same result is given. Test file is also same format as above?

usptact commented 7 years ago

Everything should be tab separated. Including features.

uugan commented 7 years ago

-- Data must have an empty line at the end of each sequence! -- that was the reason why went wrong. About training and test data structure: -- after each label must be tab and attributes or features separated by space.

O{TAB}w[-1]=NULL{SPACE}w[0]=my{SPACE}w[1]=name{SPACE}pos[0]=PRP{SPACE}pos[1]=NNP{SPACE}__BOS__
...
After __EOS__ there must be empty line. 

Thanks for your help!