Closed mohammadsadeghzadeh closed 9 years ago
Yes, it is possible. Here's a link to the parser training README. Before running trainParser
, you'll probably need to change your terms.txt
file to match the new phrasal and terminal labels and give them the appropriate code.
Once you've done that, there's a second README for retraining the reranker.
Thank you dear McClosky. But, what about "featInfo.*" files? Is it necessary to change them? And please consider Persian is a right-to-left language like Arabic.
It's not necessary to change the featInfo.*
files -- it will produce some parsing model with the default ones, it just might not be the best you can do for Persian. It could potentially be helpful to adjust those as they're likely (at least somewhat) optimized for English.
As long as the words are encoded from left-to-right, the different text direction shouldn't matter (that is, as the parser reads a stream of tokens, it should see the first spoken word first, etc.).
Another thing in the parser that might be worth exploring is the unknown word model. The English unknown word model only looks at the last two characters, hyphenation, and capitalization to determine the part of speech, but it might be worth tailoring this to Persian.
I prepared a Penn Tree Bank format for Persian. This tree bank contains 1028 sentences. unfortunately this count is low. What is your guide for selecting training and development corpus? Thank you so much.
Given the circumstances, I think you might be able to get by with only 100-200 sentences of development, though I haven't actually tried it.
With only 1028 sentences total, the overall parser probably won't work too well. You should try using the small corpus flag (parseIt -s). This will enable some extra smoothing.
Closing this issue for now -- please let me know if you have more questions.
Is it possible to make Persian language parser and reranker based on bllip-parser? Assume i have a Persian tree bank, what i have to do?