emorynlp / nlp4j-old

NLP tools developed by Emory University.
Other
60 stars 19 forks source link

Universal Dependencies #26

Closed benson-basis closed 8 years ago

benson-basis commented 8 years ago

Do you know of any converter from the NLP4J representation to Universal Dependencies?

Or, conversely, do you have any sense of what would happen if someone trained the dependency model on UD input?

jdchoi77 commented 8 years ago

UD can be easily trained on NLP4J. Our current dependency format more or less adapts UD; however, there were a few parts that we do not want to follow in UD so it's not likely we'll make the total transfer to UD.

benson-basis commented 8 years ago

I created a little wrapper around NLPTrain, and then I adapted your sample training XML file. I got a fairly low score.

633320 [main] INFO edu.emory.mathcs.nlp.common.util.BinUtils - 0: Best: 86.00, epoch = 15
633341 [main] INFO edu.emory.mathcs.nlp.common.util.BinUtils - Saving the model

I've pasted the args to NLPTrain and the XML here in case something jumps out at you that looks stupid.

I wonder about the 'feats' column, not to mention all the many possible parameters.

../command/target/appassembler/bin/nlptrain \
    -mode dep -c config-train-dep.xml \
    -t /data/universal-dependencies/ud-treebanks-v1.3/UD_English/en-ud-train.conllu \
    -d /data/universal-dependencies/ud-treebanks-v1.3/UD_English/en-ud-dev.conllu \
    -m umodel.xy 
<!-- dependency parsing -->
<configuration>
    <tsv>
        <column index="1" field="form"/>
        <column index="2" field="lemma"/>
        <column index="3" field="pos"/>
        <column index="5" field="feats"/>
        <column index="6" field="dhead"/>
        <column index="7" field="deprel"/>
    </tsv>

    <lexica>
        <word_clusters field="word_form_simplified_lowercase">edu/emory/mathcs/nlp/lexica/en-brown-clusters-simplified-lowercase.xz</word_clusters>
    </lexica>

    <optimizer>
        <l1_regularization>0.00001</l1_regularization>
        <algorithm>adagrad-mini-batch</algorithm>
        <learning_rate>0.02</learning_rate>
        <feature_cutoff>2</feature_cutoff>
        <lols fixed="2" decaying="0.95"/>
        <batch_size>5</batch_size>
        <max_epoch>20</max_epoch>
        <bias>0</bias>
    </optimizer>

        <reducer>
        <lower_bound>88.91</lower_bound>
        <increment>0.01</increment>
        <iteration>2</iteration>
        <start>0.04</start>
        <range>0.005</range>
    </reducer>

    <feature_template>
        <!-- basic features -->
        <feature f0="i:lemma"/>
        <feature f0="j:lemma"/>
        <feature f0="i:part_of_speech_tag"/>
        <feature f0="j:part_of_speech_tag"/>

        <feature f0="i:part_of_speech_tag" f1="i:lemma"/>
        <feature f0="j:part_of_speech_tag" f1="j:lemma"/>

        <feature f0="i:part_of_speech_tag" f1="j:part_of_speech_tag"/>
        <feature f0="i:part_of_speech_tag" f1="j:lemma"/>
        <feature f0="i:lemma"              f1="j:part_of_speech_tag"/>
        <feature f0="i:lemma"              f1="j:lemma"/>

        <!-- 1-gram features -->
        <feature f0="k-1:lemma"/>
        <feature f0="i-1:lemma"/>
        <feature f0="i+1:lemma"/>
        <feature f0="j-2:lemma"/>
        <feature f0="j-1:lemma"/>
        <feature f0="j+1:lemma"/>
        <feature f0="j+2:lemma"/>

        <feature f0="i-2:part_of_speech_tag"/>
        <feature f0="i-1:part_of_speech_tag"/>
        <feature f0="i+1:part_of_speech_tag"/>
        <feature f0="i+2:part_of_speech_tag"/>
        <feature f0="j-1:part_of_speech_tag"/>
        <feature f0="j+1:part_of_speech_tag"/>

        <!-- 2-gram features -->
        <feature f0="i:part_of_speech_tag" f1="k-1:part_of_speech_tag"/>
        <feature f0="i:part_of_speech_tag" f1="j+1:part_of_speech_tag"/>
        <feature f0="j:part_of_speech_tag" f1="k-1:part_of_speech_tag"/>

        <feature f0="i:lemma" f1="j-1:part_of_speech_tag"/>
        <feature f0="i:lemma" f1="j+1:part_of_speech_tag"/>
        <feature f0="j:lemma" f1="j+1:part_of_speech_tag"/>

        <feature f0="j+1:lemma" f1="i:part_of_speech_tag"/>
        <feature f0="j+1:lemma" f1="j:part_of_speech_tag"/>
        <feature f0="i+1:lemma" f1="i:lemma"/>
        <feature f0="i+1:lemma" f1="j:lemma"/>

        <!-- 3-gram features -->
        <feature f0="i-2:part_of_speech_tag" f1="i-1:part_of_speech_tag" f2="i:part_of_speech_tag"/>
        <feature f0="i-1:part_of_speech_tag" f1="i:part_of_speech_tag"   f2="i+1:part_of_speech_tag"/>
        <feature f0="j-1:part_of_speech_tag" f1="j:part_of_speech_tag"   f2="j+1:part_of_speech_tag"/>
        <feature f0="j:part_of_speech_tag"   f1="j+1:part_of_speech_tag" f2="j+2:part_of_speech_tag"/>

        <feature f0="k-2:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="i-1:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="i+1:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="j-2:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="j-1:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="j+1:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="j+2:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="j+3:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>

        <!-- valency features -->
        <feature f0="i:valency:all" f1="i:lemma"/>
        <feature f0="j:valency:all" f1="j:lemma"/>

        <!-- 2nd-order features -->
        <feature f0="i:dependency_label"/>
        <feature f0="j:dependency_label"/>
        <feature f0="i_lmd:dependency_label"/>

        <feature f0="i_h:lemma"/>
        <feature f0="i_lmd:lemma"/>
        <feature f0="i_rmd:lemma"/>
        <feature f0="j_lmd:lemma"/>

        <feature f0="i_h:part_of_speech_tag"/>
        <feature f0="i_rmd:part_of_speech_tag"/>
        <feature f0="j_lmd:part_of_speech_tag"/>

        <feature f0="i:dependency_label" f1="i:lemma"/>
        <feature f0="i:dependency_label" f1="j:lemma"/>
        <feature f0="i:dependency_label" f1="i:part_of_speech_tag"/>
        <feature f0="i:dependency_label" f1="j:part_of_speech_tag"/>

        <feature f0="i_lmd:dependency_label"   f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="i_rmd:dependency_label"   f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="j_lmd:dependency_label"   f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="i_lns:dependency_label"   f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>

        <feature f0="i_lmd:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="i_rmd:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="j_lmd:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>

        <!-- 3rd-order features -->
        <feature f0="i_h:dependency_label"/>
        <feature f0="j_h:dependency_label"/>

        <feature f0="i_h2:lemma"/>
        <feature f0="j_lmd2:lemma"/>

        <feature f0="i_lmd2:part_of_speech_tag"/>
        <feature f0="i_rmd2:part_of_speech_tag"/>
        <feature f0="j_lmd2:part_of_speech_tag"/>

        <feature f0="i_h:dependency_label" f1="i:lemma"/>
        <feature f0="i_h:dependency_label" f1="j:lemma"/>
        <feature f0="i_h:dependency_label" f1="j:part_of_speech_tag"/>

        <feature f0="i_lns2:dependency_label"   f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="i_lmd2:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="i_rmd2:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>
        <feature f0="j_lmd2:part_of_speech_tag" f1="i:part_of_speech_tag" f2="j:part_of_speech_tag"/>

        <feature f0="i_lmd2:part_of_speech_tag" f1="i_lmd:part_of_speech_tag" f2="i:part_of_speech_tag"/>

        <!-- distributional semantics features -->
        <feature set="true" f0="i:word_clusters"/>
        <feature set="true" f0="j:word_clusters"/>
        <feature set="true" f0="i+1:word_clusters"/>
        <feature set="true" f0="j+1:word_clusters"/>

        <!-- positional features -->
        <feature set="true" f0="i:positional"/>
        <feature set="true" f0="j:positional"/>
    </feature_template>
</configuration>
jdchoi77 commented 8 years ago

The score probably will improve with some hyper-parameter tuning but I'm wondering how well the other parsers do on this dataset. Do you have any sense? Thanks.

benson-basis commented 8 years ago

What we know is the scores that Stanford publishes for their Universal model. What we don't know is what data it's trained on -- it might not be this, or it might not be only this. We're looking into that and will report back here what we learn. We also just noticed the tool at Stanford for converting from PTB to UD, so I think our next step will be to round up all that data we can get, add it to the data, and see if the results approach Stanford's reported accuracy.

jdchoi77 commented 8 years ago

A recent paper I just reviewed for EMNLP shows about 79% LAS. Are we getting 86% LAS or UAS?

benson-basis commented 8 years ago

I pasted in the last score printed by NLPTrain; I don't know which scoring method it uses. In fact, I don't know if those numbers are, in fact, really a score, or just some internal artifact of the training process.

benson-basis commented 8 years ago

Oh, D'oh. What you want is:

429911 [main] INFO  edu.emory.mathcs.nlp.common.util.BinUtils  -  0:   13: LAS = 86.12, UAS = 88.22, L = 131, SF =  647514, NZW =  4030674, N/S =  11784

So, we're doing pretty well here. The '.12' is from switching from the Universal POS column to the PTB pos column in the corpus.

I'm currently using the 'gold' POS tags from the UD corpus. I set up a version of the data in which I ran the corpus through the NLP4J tagger, but the training procedure stalls after a terrible first iteration. I probably did something pretty dumb; I'm trying to figure out what.

benson-basis commented 8 years ago

I got the retagging working, the result:

245535 [main] INFO  edu.emory.mathcs.nlp.common.util.BinUtils  -  0:    7: LAS = 84.04, UAS = 86.88, L = 131, SF =  567594, NZW =  3440448, N/S =  12364

It's not surprising that this is worse, but it's presumably what we want to use, since in real life we're using the actual output of your tagger.

benson-basis commented 8 years ago

I did a quick test of Stanford on the UD dev set. I didn't let them predict the POS tags in this run. In any case, it suggests that I'm getting perfectly good results with NLP4J, so we could close out this issue.

UAS = 81.6625 LAS = 77.3502

jdchoi77 commented 8 years ago

Great, thanks for checking this out. I'm closing this issue.