chokkan / crfsuite

CRFsuite: a fast implementation of Conditional Random Fields (CRFs)
http://www.chokkan.org/software/crfsuite/
Other
648 stars 208 forks source link

Hindi Language NER Training format #98

Open dexception opened 6 years ago

dexception commented 6 years ago

I already have tagged data in the following format:

WORD tab POS_TAG tab NER

भारत NNP loc की PSP O पहली QO O महिला NNC O फोटो NN O जर्नलिस्ट NNPC O होमी NNP per सरकार NN O से PSP O अलग JJ O होना VM O पूरी NN O की PSP O सम्भावना NN O है VM O

I have around 260 MB size of data and i want to use CRFSuite. But not sure how to go about it. Can any of you guys explain how many columns do i need to prepare for each sentence.

Thank you.

usptact commented 6 years ago

@dexception Did you read the documentation on the format? Did you check out the simple test file and did you try to run it?

dexception commented 6 years ago

Thanks for replying! Yes i read the documentation regarding training format. I did run the example. I have the following query:

http://www.chokkan.org/software/crfsuite/data_sample.png What i can see in the above example is previous word, word, next word along with POS tagging.

I also want to do Person, Org, Location, Date and Event tagging along with POS tagging. I didn't find anything related to it. :-(

usptact commented 6 years ago

@dexception Yes, this is the right example. Do you understand how it is built?

dexception commented 6 years ago

Since i am a Java Developer. I am using the Java Wrapper for CRFSuite. It already has binaries for different OS in the Library. I am just confused about additional tags for NER. Can you show me how to tag NER entities for a couple of sentences along with POS tags ?

https://github.com/vinhkhuc/jcrfsuite

usptact commented 6 years ago

@dexception The CRFSuite format is different from the file you showed. You need to explicitly craft the features for each token of a sequence. The first element of the line is the label. The other elements are the features for that token. The features are completely free-form as long as you attribute a meaning to it (more on that below). Separate examples are separated by a blank line.

If you look carefully at the CRFSuite sample file, you should notice regularities/pattern in what features are extracted. Hint, current token is encoded with index "0", previous with index "-1" and next with index="+1".

In your example, I assume the first column is the token value, the second is POS tag and the last is NER tag. Is that correct? You need to write a script that will define the features you want CRF to use. For each token you can define the features:

Replace "..." with the respective value from your file.

You need to introduce a special feature for before/next lookups if you are at the boundaries. The very same test example shows an example.

I believe there is a python script in CRFSuite directories that allows you to convert your format file to CRFSuite format.

I suggest re-reading documentation on how data is formatted and look closer to the sample file to understand how it is built.

Remember, CRFSuite does not care about string values nor tries to understand them. There are only two possibilities: (1) two strings are equal, this is a hit and (2) two strings are not equal, this is a miss.

The weight of a feature by default is 1.0. A feature that is absent, it has an implicit weight of 0.0. If the same feature appears more than once for a given token, the weights will be added automatically. You can modify the weight by putting a colon (:) at the end and put a weight immediately after. You shouldn't worry about this too much in the beginning. FYI

dexception commented 6 years ago

@usptact Many thanks for this information. I have tried this approach but the accuracy is awful. I guess CRF it not good enough to be language independent. Will post the details of my analysis when i get accuracy > 75%

usptact commented 6 years ago

@dexception I don't know the language you are trying to tag with CRF but there are many tips & tricks to get the performance 75-85% F1-score without too much of effort.

Are you trying to build a POS tagger? An NER model? What named entities are you trying to predict? If you are predicting custom named entities, their choice is critical.

First, you can build fairly performant model by carefully selecting few thousands training examples (you can get decent performance with few hundreds actually). Diversity of different patterns is the key. Don't simply dump lots of training data, use the training examples that will help the model. Use negative examples (strings with no tags), those are useful to reduce overmatching.

Second, you can significantly improve the model by selecting informative gazetteers. These should make your recall jump quite a bit. Some examples include cities, countries, days of the week, months, people first/last names etc. Anything that groups nicely in a list, is a good candidate for a gazetteer. a good gazetteer is a clean gazetteer. You must write a feature extractor for this. Totally worth it in many cases.

Third, you can use some knowledge graph features or Brown cluster features extracted from a large unlabeled corpus. This should give you extra few F1-score points.

Fourth, if your language is inflectional, you can create some specific features such as prefixes/suffixes up to some length. Then create feature disjunction. Shape features. You might want to look at Stanford NER features set. It is pretty rich. Maybe you will find a feature configuration for your language.

Finally, there is a host of tips & tricks that you will learn once you start to master building CRF models to segment and label natural language. Yes, this is my day job :) I've built and maintained English, Spanish, German and even little bit Chinese NER models.

usptact commented 6 years ago

The tutorial has everything to get you started, including a link to a script that will convert your file to CRFSuite format. http://www.chokkan.org/software/crfsuite/tutorial.html

dexception commented 6 years ago

I already have a POS tagger for Hindi Language and as you wrote earlier i am writing rules. Based on last names to identify people and words like corporation, ltd, school, college, university to identify org. Dates with days of week, months, For events heart attack, audition, interview...already have more than 5000 words for that. I am not sure CRF can beat a Rule based system in terms of accuracy for Hindi Language. I did use Lingpipe CRF again the results were very poor. So i will probably post my findings by next week.

The objective is to have a language independent NER system provided there is already a POS tagger available for it. So if you just add the POS tagger and voila your CRF is ready.

usptact commented 6 years ago

@dexception The beauty of CRF for NER is that you don't need to write the rules explicitly. CRF is pretty good at finding those patterns for you. Rules-based systems are indeed more precise but are also much more tedious and only get more complicated with time, which hinders maintainability.

Good luck with your experiments!

dexception commented 6 years ago

This is \t(tab) separated format. For those who need help in training the following format is working for me:

O   w[0]=$  w[1]=खराब   pos[0]=NN   pos[1]  JJ  __BOS__
O   w[-1]=$ w[0]=खराब   w[1]=खाने   pos[-1]=VMpos[0]=JJ pos[1]  VM
O   w[-1]=खराब  w[0]=खाने   w[1]=का pos[-1]=PSPpos[0]=VM    pos[1]  PSP
O   w[-1]=खाने  w[0]=का w[1]=यह pos[-1]=DEMpos[0]=PSP   pos[1]  DEM
O   w[-1]=का    w[0]=यह w[1]=मामला  pos[-1]=MODpos[0]=DEM   pos[1]  MOD
O   w[-1]=यह    w[0]=मामला  w[1]=चल pos[-1]=VMpos[0]=MOD    pos[1]  VM
O   w[-1]=मामला w[0]=चल w[1]=ही pos[-1]=RPpos[0]=VM pos[1]  RP
O   w[-1]=चल    w[0]=ही w[1]=रहा    pos[-1]=VMpos[0]=RP pos[1]  VM
O   w[-1]=ही    w[0]=रहा    w[1]=था pos[-1]=VMpos[0]=VM pos[1]  VM
O   w[-1]=रहा   w[0]=था w[1]=,  pos[-1]=SYMpos[0]=VM    pos[1]  SYM
O   w[-1]=था    w[0]=,  w[1]=तभी    pos[-1]=PRPpos[0]=SYM   pos[1]  PRP
O   w[-1]=, w[0]=तभी    w[1]=सेना   pos[-1]=NNpos[0]=PRP    pos[1]  NN
org w[-1]=तभी   w[0]=सेना   w[1]=में    pos[-1]=PSPpos[0]=NN    pos[1]  PSP
O   w[-1]=सेना  w[0]=में    w[1]=सहायक  pos[-1]=JJpos[0]=PSP    pos[1]  JJ
O   w[-1]=में   w[0]=सहायक  w[1]=प्रणाली    pos[-1]=NNpos[0]=JJ pos[1]  NN
O   w[-1]=सहायक w[0]=प्रणाली    w[1]=बैटमैन pos[-1]=NNpos[0]=NN pos[1]  NN
O   w[-1]=प्रणाली   w[0]=बैटमैन w[1]=सिस्टम pos[-1]=NNpos[0]=NN pos[1]  NN
O   w[-1]=बैटमैन    w[0]=सिस्टम w[1]=पर pos[-1]=CCpos[0]=NN pos[1]  CC
O   w[-1]=सिस्टम    w[0]=पर w[1]=सवाल   pos[-1]=NNpos[0]=CC pos[1]  NN
O   w[-1]=पर    w[0]=सवाल   w[1]=उठ pos[-1]=VMpos[0]=NN pos[1]  VM
O   w[-1]=सवाल  w[0]=उठ w[1]=गया    pos[-1]=VMpos[0]=VM pos[1]  VM
O   w[-1]=उठ    w[0]=गया    w[1]=.  pos[-1]VM   pos[0]=VM   pos[1]=.
0   w[-1]=गया   w[0]=.  pos[-1]=VM  pos[0]=.    __EOS__
rohanjhaaa commented 6 years ago

@dexception - Do you tell me what is your word feature of CRF for Hindi ? I really need it please help me out

dexception commented 6 years ago

Well the entire dataset is around 11 GB. Training takes around 10-12 days. I allocate about 220 GB RAM.

It's a commercial product so can't reveal more details.

image

usptact commented 6 years ago

@dexception A protip: For very large datasets you can use the l2sgd or the arow algorithm. The latter is very fast and consume not much of RAM! There are two parameters to tune to get the best performance.

The algorithm can be set during the training phase like this: crfsuite learn -a arow

To see the parameters and their description: crfsuite learn -a arow -H

For the AROW algorithm, you want to tune the "variance" and "gamma" parameters. You can set those on a small subset of your data.

If you don't want to load the complete dataset in memory, unfortunately you can't do that with crfsuite. You can learn a sequence tagging model using Vowpal Wabbit using the Learning to Search facilities. The learning is online and there are multiple passes over the data. That is the subject of another discussion though.

For the size of the dataset you have, you may consider using a Deep Learning model like LSTM. You should be able to get a better performance.

dexception commented 6 years ago

@usptact

Java version of the Tensorflow is not stable enough to be put into production. So trying to manage with what is available. In future we will implement our own library but that will take time.