datquocnguyen / RDRPOSTagger

A fast and accurate POS and morphological tagging toolkit (EACL 2014)
http://rdrpostagger.sourceforge.net
Other
140 stars 48 forks source link

Successfully trained the RDRPOSTagger in Tamil #2

Closed AshokR closed 8 years ago

AshokR commented 8 years ago

I am happy to report that, after extensive tweaking of my gold standard training corpus, I have successfully trained the tagger with a corpus of about 200,000 Tamil words. I used 80% of the corpus for training and 20% for testing. I see a difference of about 15% from my gold standard testing corpus.

It will be great if you can take a look at my corpus and let me know whether there is anything I can do to improve it.

datquocnguyen commented 8 years ago

So the result now is about 85% tagging accuracy? One approach is to use better initial tagger. The internal initial tagger developed inside RDRPOSTagger uses a lexicon to assign a tag for each word, in which the lexicon is extracted from the training corpus, so this internal initial tagger is a weak initial tagger. You can improve the tagging accuracy by using a stronger "external" initial tagger such as TnT tagger.

It would be nice to have an empirical study (i.e. paper) of evaluating a range of POS taggers on your corpus :)

AshokR commented 8 years ago

On further analysis I find that about half of the ones that had different tags from my gold standard testing corpus were all within the noun family. Either a compound noun is tagged as a noun or vice versa. This is not a show stopper. Excluding this, I see an error rate of only about 8%.

Thanks for your tip. I will look into the TnT tagger for the initial tagging.

And thanks again for making this software available and as open source!

datquocnguyen commented 8 years ago

FYI. TnT tagger can be also download from http://heartofgold.dfki.de/pkg/components-tnt.tar.gz