UppsalaNLP / uuparser

A transition-based parser for Universal Dependencies with BiLSTM word and character representations.
Apache License 2.0
80 stars 25 forks source link

scan for treebanks rather than reading json file #1

Closed jowagner closed 6 years ago

jowagner commented 6 years ago

Add support for reading the list of supported iso_ids and the long folder names directly from the treebank datadir rather than requiring the user to maintain a json file. Make this the default and provide an option to request the old behaviour.

jowagner commented 6 years ago

I understand. Rational of the PR is that it is not good to have the treebank description inside the parser source code tree. A parser should work regardless of the version of the treebank, it should work with new treebanks and users shouldn't need to modify code. As you can see, the PR includes lines to output the list of supported treebank IDs if the user specifies an ID that is not in the json file. That was part of the debugging to understand why the parser wouldn't accept my treebank IDs.

I like the alternative you are hinting at to include a tool that generates the json file from a treebank folder. This should be combined with a parser option for the json file location.

As the set of UD treebanks will continue to change in the future, however, I think the default should either be to make it mandatory to specify the json file (the documentation could say that a sample json file for UD treebank version x.y is included in folder z) or to scan the folder as in this PR.

mdelhoneux commented 6 years ago

Yep, I understand your point of view! In general, we need to improve the documentation of this parser but definitely thanks for pointing this out!

mdelhoneux commented 6 years ago

Sorry for the delay in updating you on this but we now added an option --json-isos to specify the json file (see commit 530e5e29f5637fce10185d0cff2d08ae77eb5171) and in addition, added the script used to create that json file for a new UD release (see commit 163c886683f6e1774094c38fe6030f9266fee709). Usage:

python scripts/create_json_file.py UD_data_directory output_json_file