Closed jbrry closed 4 years ago
Thanks for your contribution, your assessments are correct. I wanted to do the same thing you have done but never got around to it. If you can, a PR would be greatly appreciated.
Thanks, I made a few other small changes: conllu returns tuple objects for elided tokens and multiword tokens, e.g. (8, '.', 1)
and (105, '-', 106)
respectively. I had to add another check which sets the token id to None when the token is an elided token as well.
I have confirmed that the outputs are the same for both en_ewt and fr_gsd (fr_gsd only shown here):
jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_updated/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | wc -l
44020
jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_updated/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | md5sum
c489d1a0890b84e6f30272feca0905f2
jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_original/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | wc -l
44020
jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_original/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | md5sum
c489d1a0890b84e6f30272feca0905f2
I will do a full run on en_ewt and fr_gsd and use the predictor to make sure everything is still working as normal there and then submit the PR!
I did a few checks and everything seems to have the same behaviour.
I trained a model for en_ewt
and fr_gsd
using the upstream conllu
library and the post-processing of token IDs/MWTs in parser.py.
I then used predict.py
to predict on the dev files. I then reverted the dataset reader to the original version v1
to compare outputs:
# fr_gsd conllu output files
1b5016fa8ce46578c853035dac673b7b pred_fr_gsd.conllu
1b5016fa8ce46578c853035dac673b7b pred_fr_gsd-v1.conllu
# en_ewt conllu output files
9b4560573aa4a5057b83fe9057749128 pred_en_ewt.conllu
9b4560573aa4a5057b83fe9057749128 pred_en_ewt-v1.conllu
I then downloaded the udify-model.tar.gz
model from Lindat and predicted with the original v1
and updated dataset reader:
# fr_gsd conllu output files
9369a87a7502f64054ff2b7ba125ea25 udify_pred_fr_gsd.conllu
9369a87a7502f64054ff2b7ba125ea25 udify_pred_fr_gsd-v1.conllu
# en_ewt conllu output files
d2189dfdfef3487606de4e4350bac676 udify_pred_en_ewt.conllu
d2189dfdfef3487606de4e4350bac676 udify_pred_en_ewt-v1.conllu
# Accordingly, the JSON results files are also the same:
d9899d7d68a7f63cf4684d757031dea8 udify_pred_fr_gsd.json
d9899d7d68a7f63cf4684d757031dea8 udify_pred_fr_gsd-v1.json
So I think that it should be safe enough to make these changes without breaking the code, though users may need to install the latest conllu
version to use parse_incr
:
pip install -U conllu
I will submit the PR now.
Finally got around to reviewing your changes. Looks great to me.
Thanks Dan!
Hi Dan, I see in the code and in #5 that updating the conllu library is on the agenda.
I have made a few modifications on my forked version of UDify. From what I understand, parser.py contains some source code from the conllu library with a few modifications, mainly to handle multi-word tokens, where the desired output (example from
fr_gsd-ud-train.conllu
) looks like:In my forked version, I am still using the conllu library to return the annotation but do the MWT processing in a subsequent step in a
process_MWTs
function. In this version, I confirmed that the outputs are the same:I have done another few checks to make sure the data is the same, where updated is the forked version and original is the current version e.g.:
There are a few benefits I can see from this:
conllu
library.parser.py
There are probably more elegant ways of going about MWT processing but I just thought I'd post it here in case you find it helpful. If you do, I can do more tests and once confirming behaviour is exactly the same, I can submit a PR.