Hyperparticle / udify

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.
https://arxiv.org/abs/1904.02099
MIT License
220 stars 56 forks source link

Updating conllu library #12

Closed jbrry closed 4 years ago

jbrry commented 4 years ago

Hi Dan, I see in the code and in #5 that updating the conllu library is on the agenda.

I have made a few modifications on my forked version of UDify. From what I understand, parser.py contains some source code from the conllu library with a few modifications, mainly to handle multi-word tokens, where the desired output (example from fr_gsd-ud-train.conllu) looks like:

multiword_ids ['3-4', '72-73', '87-88', '105-106', '110-111', '121-122']
multiword_forms ['du', 'des', 'des', 'des', 'du', 'du']

In my forked version, I am still using the conllu library to return the annotation but do the MWT processing in a subsequent step in a process_MWTs function. In this version, I confirmed that the outputs are the same:

multiword_ids ['3-4', '72-73', '87-88', '105-106', '110-111', '121-122']
multiword_forms ['du', 'des', 'des', 'des', 'du', 'du']

I have done another few checks to make sure the data is the same, where updated is the forked version and original is the current version e.g.:

cat fr_gsd_original/vocabulary/tokens.txt | md5sum
e80f1f1e341fc5734c8f3a3d1c779c55 
cat fr_gsd_updated/vocabulary/tokens.txt | md5sum
e80f1f1e341fc5734c8f3a3d1c779c55

There are a few benefits I can see from this:

  1. Supports most recent conllu library.
  2. Reduces the amount of code needed in parser.py

There are probably more elegant ways of going about MWT processing but I just thought I'd post it here in case you find it helpful. If you do, I can do more tests and once confirming behaviour is exactly the same, I can submit a PR.

Hyperparticle commented 4 years ago

Thanks for your contribution, your assessments are correct. I wanted to do the same thing you have done but never got around to it. If you can, a PR would be greatly appreciated.

jbrry commented 4 years ago

Thanks, I made a few other small changes: conllu returns tuple objects for elided tokens and multiword tokens, e.g. (8, '.', 1) and (105, '-', 106) respectively. I had to add another check which sets the token id to None when the token is an elided token as well.

I have confirmed that the outputs are the same for both en_ewt and fr_gsd (fr_gsd only shown here):

jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_updated/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | wc -l
44020
jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_updated/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | md5sum
c489d1a0890b84e6f30272feca0905f2

jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_original/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | wc -l
44020
jbarry@jbarry-desktop:~/udify/logs/fr_gsd/fr_gsd_original/vocabulary$ cat feats.txt head_tags.txt lemmas.txt token_characters.txt tokens.txt upos.txt xpos.txt | md5sum
c489d1a0890b84e6f30272feca0905f2

I will do a full run on en_ewt and fr_gsd and use the predictor to make sure everything is still working as normal there and then submit the PR!

jbrry commented 4 years ago

I did a few checks and everything seems to have the same behaviour.

I trained a model for en_ewt and fr_gsd using the upstream conllu library and the post-processing of token IDs/MWTs in parser.py. I then used predict.py to predict on the dev files. I then reverted the dataset reader to the original version v1 to compare outputs:

# fr_gsd conllu output files
1b5016fa8ce46578c853035dac673b7b  pred_fr_gsd.conllu
1b5016fa8ce46578c853035dac673b7b  pred_fr_gsd-v1.conllu

# en_ewt conllu output files
9b4560573aa4a5057b83fe9057749128  pred_en_ewt.conllu
9b4560573aa4a5057b83fe9057749128  pred_en_ewt-v1.conllu

I then downloaded the udify-model.tar.gz model from Lindat and predicted with the original v1 and updated dataset reader:

# fr_gsd conllu output files
9369a87a7502f64054ff2b7ba125ea25  udify_pred_fr_gsd.conllu
9369a87a7502f64054ff2b7ba125ea25  udify_pred_fr_gsd-v1.conllu

# en_ewt conllu output files
d2189dfdfef3487606de4e4350bac676  udify_pred_en_ewt.conllu
d2189dfdfef3487606de4e4350bac676  udify_pred_en_ewt-v1.conllu

# Accordingly, the JSON results files are also the same:
d9899d7d68a7f63cf4684d757031dea8  udify_pred_fr_gsd.json
d9899d7d68a7f63cf4684d757031dea8  udify_pred_fr_gsd-v1.json

So I think that it should be safe enough to make these changes without breaking the code, though users may need to install the latest conllu version to use parse_incr:

pip install -U conllu

I will submit the PR now.

Hyperparticle commented 4 years ago

Finally got around to reviewing your changes. Looks great to me.

jbrry commented 4 years ago

Thanks Dan!