EmilStenstrom / conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
MIT License
310 stars 50 forks source link

Enhanced dependencies fail #40

Closed mdelhoneux closed 4 years ago

mdelhoneux commented 4 years ago

I found a new case where the regular expression for parsing enhanced representations fails in the Arabic training set, see https://github.com/mdelhoneux/conllu/blob/cfc45fb5e52e4fe714472a2002464db4c6876cec/tests/test_parser.py#L477, I have not yet managed to fix this without breaking other tests.

EmilStenstrom commented 4 years ago

Thank you! I'll have a look when I have time.

More generally: Are you going through all languages and parsing them with conllu? Seems like something I should do myself to ensure compatibility with the format :)

mdelhoneux commented 4 years ago

The languages for which there are enhanced dependencies, yes (as part of the iwpt 2020 shared task: https://universaldependencies.org/iwpt20/). Thanks!

EmilStenstrom commented 4 years ago

@mdelhoneux Thanks again! I managed to fix this in https://github.com/EmilStenstrom/conllu/commit/5d4df35cd1900e42e093f38e5f5d9cb813ad1993

I haven't had the time to parse through all the iwpt 2020 datasets, so feel free to add more failing tests if you find any!

jbrry commented 4 years ago

My script was failing for the treebanks ar_padt and ta_ttb. Since installing this latest version, my dataset reader can parse both of these treebanks now. Thanks both for your help!

EmilStenstrom commented 4 years ago

@Jbar-ry Great to hear! Let me know if there are more issues you encounter!