tokenisation in texts - Githubissues

flammie commented 6 years ago

I was just testing some udpipe-loaded omorfi with some brittle scripts and ran sanity checks on the trains but the since I used texts from the comments I ran into tokenisation alignment problems, e.g. in the beginning of the fi-ud-train.conllu I get

# text = Vähän samanlainen tunne kuin silloin, kun ystävämme vei meidät kerran ylöstuomiokirkon torniin.
12  ylös    ylös    ADV Adv _   14  advmod  _   SpaceAfter=No
13  tuomiokirkon    tuomio#kirkko   NOUN    N   Case=Gen|Number=Sing    14  nmod:poss   _   _

so everything gets out of whack. I guess my question is if it's plausible to have tool to try tokenising harder or if I should just start with the pre-existing tokenisation? I guess split heuristics would not be too hard to implement at least.

jnivre commented 6 years ago

This is maybe not directly related to your question, but is it a typo that "ylös" and "tuomiokirkon" are run together without space? Otherwise, it seems that "ylöstuomiokirkon" should have been represented as a multiword token. That is:

12-13 ylöstuomiokirkon - 12 ylös ylös ADV Adv 14 advmod SpaceAfter=No 13 tuomiokirkon tuomio#kirkko NOUN N Case=Gen|Number=Sing 14 nmod:poss _

fginter commented 6 years ago

ylöstuomiokirkon is a pure typo, so I guess the way this is encoded in conllu is correct.

jnivre commented 6 years ago

Yep. That's what I suspected.

fginter commented 6 years ago

@flammie - one can try tokenizing harder, but a splitting heuristic will not be totally trivial to develop without having it erroneously split plenty of valid compounds. The tokenization in the data is manually checked, so if you want to avoid these kinds of problems, use the one pre-tokenized version. Of course in real life, you will then hit these typos, etc. :)

UniversalDependencies / UD_Finnish-TDT

tokenisation in texts #4