Ambiguity & Word Cluster Classes: POS Tagging

dulocian commented 7 years ago

Hi,

I would like to know whether it is possible to train my own Ambiguity and Cluster models to be used with POS tagging South African languages.

The only options available are the currently included models:

en-ambiguity-classes-simplified.xz
en-ambiguity-classes-simplified-lowercase.xz
en-brown-clusters-simplified-lowercase.xz
en-brown-clusters-twit-lowercase.xz

If it be possible, how could I go about creating them?

Regards

jdchoi77 commented 7 years ago

Brown cluster is relatively easy; you can use any available tool to generate the brown clusters and use the following script to convert into the NLP4J format:

https://github.com/emorynlp/nlp4j/blob/master/cli/src/main/java/edu/emory/mathcs/nlp/bin/BrownClusterExtract.java

Ambiguity class is a hashmap, where the key is a word and the value is the list of possible pos tags. You can save this also to a java object and compress it to the xz format.

Please let me know if this makes sense. Thanks.

best,

Jinho

dulocian commented 6 years ago

Hi Jinho,

I made use of Percy Liang's C++ implementation of the Brown hierarchical word clustering algorithm.

Once the clusters are created using Liang's implementation, they are then converted using the script specified in your your response. These converted files are then placed in nlp4j-english-1.1.2.jar in the lexica directory alongside the other cluster and ambiguity classes.

In the config-decode-pos.xml and the config-train-pos.xml files, the following lexica field is adapted: <word_clusters field="word_form_lowercase">edu/emory/mathcs/nlp/lexica/SA-lang-clusters.xz</word_clusters>

No errors arise when training with these specifications, however the accuracy of the PoS tagger model remains unchanged when compared to its control model which is trained without the cluster class. I am not sure what could be the cause of this.

I have also tried all of the possible word cluster fields, including:

word_form,
word_form_lowercase,
word_form_undigitalized,
word_form_simplified,
word_form_simplified_lowercase,
word_shape,
word_shape_lowercase,
orthographic,
orthographic_lowercase,

Please assist.

Regards, J.

emorynlp / nlp4j

Ambiguity & Word Cluster Classes: POS Tagging #30