Closed uugan closed 7 years ago
Do you know what is the format that CRFSuite accepts? Did you check the docs?
I'm not sure but i read from here : http://www.chokkan.org/software/crfsuite/manual.html#idp8853717120 But it's too complicated to create [Conll2000]. Is that possible to train data set like above[Conll2003]?
Yes, you need to write a python script that extracts the features from your data file. I believe that there is such a script in CRFSuite repo directories somewhere.
If you read the feature file that you linked, do you understand what each block, each line and elements mean?
If you read the feature file that you linked, do you understand what each block, each line and elements mean?
Not fully understand. In that manual shown only POS-Tagging. But I to need train data for NER(tags like organisation, person, location etc). How to create that kind of data for training? Also this library only works with file input & output by tag function. Is that possible to add function which works with input string for predicting/testing?
The principle is exactly the same. You need to write a script that will output features for each token in a sequence (e.g. "current word", "previous word", "next word" etc). For each token you also need to provide the named entity label, similarly as with POS.
if you look carefully at the features (column 2 and after that), you will notice that those features can be a good starting point for your NER model as well.
The manual page you linked is actually pretty good at explaining. Is there something in particular that you didn't understand there?
Take a look at the crfsuite/example/ner.py
Let's say that you have a string like "my name is uugan" and you want "uugan" tagged as PERSON.
Your features file can look something like this:
O w[0]=my w[-1]=NULL w[+1]=is
O w[0]=is w[-1]=my w[+1]=name
O w[0]=name w[-1]=is w[+1]=uugan
B-PERSON w[0]=uugan w[-1]=name w[+1]=NULL
You provide a file like this with as many empty line delimited blocks like this. Every blank line separates a string from another.
Look at the features and make sure you understand how I built them. Note that CRFSuite does not care about actual string values. "w[0]=my" is as different from "w[-1]=NULL" as with "w[+1]=is". What count are presence or absence of features. It is up to you to come up with features.
I've just tested with win32 binary. And learn function is got stuck. train file(is it ok without pos array? in ner.py : Field names of the input data. fields = 'y w pos chk'):
O w[0]=my w[-1]=NULL w[1]=name __BOS__
O w[0]=name w[-1]=my w[1]=is
B-PERSON w[0]=uugan w[-1]=is w[1]=NULL __EOS__
Start to train(learn):
>crfsuite.exe learn -m crfsuite_train_ner.model crfsuite_train_ner.txt
CRFSuite 0.12 Copyright (c) 2007-2011 Naoaki Okazaki
Start time of the training: 2017-09-02T11:47:13Z
Reading the data set(s)
[1] conll2000\crfsuite_train_ner.txt
0....1....2....3....4....5....6....7....8....9....10
but after that line nothing prints in console. Is that right data format? POS-tagging format trained OK. But our NER format is wrong. Added pos array in train file also same result is given. Test file is also same format as above?
Everything should be tab separated. Including features.
-- Data must have an empty line at the end of each sequence! -- that was the reason why went wrong. About training and test data structure: -- after each label must be tab and attributes or features separated by space.
O{TAB}w[-1]=NULL{SPACE}w[0]=my{SPACE}w[1]=name{SPACE}pos[0]=PRP{SPACE}pos[1]=NNP{SPACE}__BOS__
...
After __EOS__ there must be empty line.
Thanks for your help!
Is there any one knows about how to create training data set for crfsuite for named-entity recognition? How to convert data set like below to crfsuite structure? ... Commission NNP I-NP I-ORG said VBD I-VP O on IN I-PP O Thursday NNP I-NP O it PRP B-NP O disagreed VBD I-VP O with IN I-PP O German JJ I-NP I-MISC ...