Hyperparticle / udify

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.
https://arxiv.org/abs/1904.02099
MIT License
219 stars 56 forks source link

predict.py to work with .conllu files NOT annotated for dependencies? #26

Closed lmompela closed 2 years ago

lmompela commented 2 years ago

Hi there,

I was wondering whether there was a way for me to use predict.py with my corpus data (.conllu) which is not annotated for dependencies, but is annotated for POS. My goal is not to calculate evaluation metrics, at the moment, but rather have my pretrained model give me predictions on dependencies to hopefully get a head start with dependency annotations. I am working on an underdocumented language and would like to have a first row of dependencies predictions that I would then go back to, verify and update to create the GOLD standard for my language.

Is there a reason my input file have to conform to the conllu format other than for evaluation metrics? My issue seems to be that my "head" and "deprel" columns are not integers but simply "_" because they're empty. I would preferably like to keep the .conllu format of my input file as it contains POS information already which could give me better predictions.

Thank you for the research, it's super helpful, especially for underdocumented languages.

Here is my error message : image

lmompela commented 2 years ago

Nevermind, found a way around! Thanks

andidyer commented 6 months ago

Nevermind, found a way around! Thanks

I suppose this is a while ago, but do you remember what your solution was? I am also trying to parse a file without annotated dependencies and facing this issue.

Maybe the solution is just to fill each head field with the index of the token minus 1? This is what I did and it worked. The format looked a bit like this:

# sent_id = 1
# text = "This is a sentence"
1    This    _    _    _    _    0    _    _    _
2    is    _    _    _    _    1    _    _    _
3    a    _    _    _    _    2    _    _    _
4    sentence    _    _    _    _    3    _    _    _

# sent_id = 2
# text = "This is another sentence"
1    This    _    _    _    _    0    _    _    _
2    is    _    _    _    _    1    _    _    _
3    another    _    _    _    _    2    _    _    _
4    sentence    _    _    _    _    3    _    _    _