Hyperparticle / udify

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.
https://arxiv.org/abs/1904.02099
MIT License
219 stars 56 forks source link

Utilize conllu python library #13

Closed jbrry closed 4 years ago

jbrry commented 4 years ago

This PR addresses #12 and uses the upstream conllu library to retrieve conllu annotations. In a post-processing step, the token ids of multi-word tokens and elided tokens are set to None so that these annotations won't be used for prediction. The multiword token forms and multiword token ids are stored as normal so that behaviour is the same in the predictor.

jbrry commented 4 years ago

Hi Dan, just to update you a bit on this. I've made two newer commits which are unrelated to this specific PR. Unfortunately, I didn't have the foresight to create a new branch when focusing just on the conllu library commits so this PR will track my newer commits.

What 84cc77b and b49f618 do is basically load the provided config file for en_ewt and updates train_data_path, validation_data_path, and test_data_path based on the provided name. It also updates the vocabulary path. I also added a small utility which counts the number of sentences in the training CoNLL-U file for name and divides this number by the batch size which gives you the same number of steps for warmup_steps and start_step as the progress bar as suggested in the README.

These changes make it a bit easier to run UDify on individual treebanks (i.e. avoids having to copy and change these values manually).

In any case, I will add a new branch so that this PR won't get unrelated commits or I could restructure this PR, i.e. close it and only submit the conllu updates from a freshly created branch