Hyperparticle / udify

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.
https://arxiv.org/abs/1904.02099
MIT License
219 stars 56 forks source link

predict.py to work with raw text files #8

Closed drunkinlove closed 4 years ago

drunkinlove commented 4 years ago

Hello!

First of all, thank you for the research and shared code, it's immensely helpful.

I wanted to know if there's an easy way for me to make predict.py work with raw text files, since this seems like the purpose of the architecture. Is there a reason my input files have to conform to the CoNLL -U format besides calculating evaluation metrics?

Hyperparticle commented 4 years ago

It's not strictly necessary to use the conllu predictor, it's just there for convenience for evaluation. All the logic for sentence input and prediction output can be found in predictor.py. The _json_to_instance() is probably what you are interested in, which takes as input a json dict with a sentence as input, which you can then tokenize (AllenNLP provides a Spacy tokenizer which can be used for multilingual text), and then pass to the dataset reader. I can get you a simple example working soon.

Hyperparticle commented 4 years ago

I added a new option --raw_text to predict.py that can take an input file of one sentence per line and output one json annotation object per line.

Hope this helps. Let me know if you need any additional help.

drunkinlove commented 4 years ago

great, thanks a lot!

jzhou316 commented 4 years ago

Hi I also found that the option to directly input a raw text file for prediction very useful. Thanks for that! I have another small question, which is related to tokenization and BPE encoding. For example, if my text file is already tokenized (with some of my own tokenizer) and split by BPE, does it still work? And how are the BPE sub-words handled in the word-level tag prediction?

Hyperparticle commented 4 years ago

The text should work if it is tokenized, though if you already split with BPE I'm not sure what will happen. You could either recombine the words so that they can be split by the model (easy) or you could modify the tokenizer code in the repo to bypass the BPE step (harder).

The BPE subword embeddings are all discarded except for the first subword which is used to represent the information content of the whole word. In my experiments, I found no discernible difference between using the first, last, or average of all embeddings. This is also explained in the paper, and there's existing work which report similar findings.

jzhou316 commented 4 years ago

Thanks for the clarification! That really helps.