Hyperparticle / LemmaTag

A neural network that jointly part-of-speech tags and lemmatizes sentences, boosting accuracy for morphologically-rich languages (Czech, Arabic, etc.)
https://arxiv.org/abs/1808.03703
MIT License
34 stars 3 forks source link

Create script to automatically download Universal Dependencies datasets #1

Closed Hyperparticle closed 6 years ago

Hyperparticle commented 6 years ago

A nice convenience would be to automatically download any UD dataset and preprocess it for training. It would be desirable to select a language and dataset_name and navigate to the GitHub repo where it can be downloaded with urllib. This may require some web scraping for finding the GitHub page, and a regex to match train, dev, and test data.

foxik commented 6 years ago

If I recall, the only official source of UD data is the LINDAT release -- the Github repos are usually used only for development (i.e., they are not required to contain branch or tag with latest release).

@dan-zeman Am I right, or is it possible to get the stable releases from Github?

Hyperparticle commented 6 years ago

@foxik I was not aware that UD is hosted on LINDAT. Going through the UD website, I could not find any links to LINDAT datasets, there were just the GitHub repos.

dan-zeman commented 6 years ago

Hmm, maybe we should think of making this more explicit and visible on the UD website. I can see how you can overlook it if you are looking just for one language and never go further once you click on the language... But in fact, the information is quite explicit on the title page below the flags. If you scroll long enough, or if you hit CTRL+F and type "download", you will end up at the Download section and see the link to Lindat. And you get all languages in one big package, you cannot download just one selected language.

Otherwise, it is actually possible to get stable releases from Github, although it is not the preferred way (because we want download statistics at one place, i.e., Lindat). Since we learned the first time that some people just take their data from Github and write papers about it, we reversed the branch logic and now we try to make sure that the contents of the master branch of each repo always corresponds to the most recent official release, while all fixes in the meantime happen in the dev branch. You still don't have 100% certainty that you get the right data if a treebank was released in the past, then became invalid due to stricter validation rules, was not fixed and was not included in the last release.

Hyperparticle commented 6 years ago

@dan-zeman Ah, I never noticed the download section at the bottom of the page, thanks! This should simplify things. And I agree, the download section should be more prominent (perhaps mentioned at or moved to the top of the page).

dan-zeman commented 6 years ago

I have added a link to the Download section from each treebank's section. Hope this helps to find it in the future.