Persian Language parser and reranker

BLLIP / bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.

http://bllip.cs.brown.edu/

227 stars 53 forks source link

Persian Language parser and reranker #22

Closed mohammadsadeghzadeh closed 9 years ago

mohammadsadeghzadeh commented 10 years ago

Is it possible to make Persian language parser and reranker based on bllip-parser? Assume i have a Persian tree bank, what i have to do?

dmcc commented 10 years ago

Yes, it is possible. Here's a link to the parser training README. Before running trainParser, you'll probably need to change your terms.txt file to match the new phrasal and terminal labels and give them the appropriate code.

Once you've done that, there's a second README for retraining the reranker.

mohammadsadeghzadeh commented 10 years ago

Thank you dear McClosky. But, what about "featInfo.*" files? Is it necessary to change them? And please consider Persian is a right-to-left language like Arabic.

dmcc commented 10 years ago

It's not necessary to change the featInfo.* files -- it will produce some parsing model with the default ones, it just might not be the best you can do for Persian. It could potentially be helpful to adjust those as they're likely (at least somewhat) optimized for English.

As long as the words are encoded from left-to-right, the different text direction shouldn't matter (that is, as the parser reads a stream of tokens, it should see the first spoken word first, etc.).

Another thing in the parser that might be worth exploring is the unknown word model. The English unknown word model only looks at the last two characters, hyphenation, and capitalization to determine the part of speech, but it might be worth tailoring this to Persian.

mohammadsadeghzadeh commented 10 years ago

I prepared a Penn Tree Bank format for Persian. This tree bank contains 1028 sentences. unfortunately this count is low. What is your guide for selecting training and development corpus? Thank you so much.

dmcc commented 10 years ago

Given the circumstances, I think you might be able to get by with only 100-200 sentences of development, though I haven't actually tried it.

With only 1028 sentences total, the overall parser probably won't work too well. You should try using the small corpus flag (parseIt -s). This will enable some extra smoothing.

dmcc commented 9 years ago

Closing this issue for now -- please let me know if you have more questions.