Add support for loading/caching corpora/datasets

reckart commented 8 years ago

Add support for loading/caching corpora/datasets. A new module could be added that provides some class with methods to download (and cache) publicly accessible datasets e.g. for use in unit tests, but potentially also for experiments.

Done

[x] add license information that can be included with the generated documentation
[x] add data role to cover all data files. getAllData() should use that info then instead of aggregating over train/test/dev.
[x] consider if split action should remain inside artifact / if it can be improved to take into account explode action such that the relative paths are the same for both. Maybe explode could update some kind of base-dir information for the artifact that split then can pick up -> no longer an action but a default method on the Dataset interface.
[x] Locate datasets through indirection by looking up patterns in META-INF/org.dkpro.core/datasets.txt
[x] document dataset descriptor format
[x] add encoding - field is there, but most datasets still lack encoding information

See also

Issue #1057

reckart commented 8 years ago

@Horsmann I just noticed your mail to the corpora list. A few days ago, I have started adding a DatasetLoader class to DKPro Core for some local experiments. Prompted by your mail, I have committed it. Would be great if we could extend this together for additional datasets.

Horsmann commented 8 years ago

Yes, sure :). Any corpus that is directly reachable by a link can be added? e.g. Tiger http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/download/start.html or http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/download/tigercorpus-2.2.xml.tar.gz

Second question would be in which data format ? Do we add the same corpus multiple times if it is available in various formats?

reckart commented 8 years ago

Right now, the class only handles downloading the files, verifying their checksums, and caching them. If a file is e.g. an archive with multiple corpus formats, it makes no difference. The loader only returns the path where it has downloaded the dataset.

I am considering various way of improving this very simple approach, e.g. by wrapping the dataset in a class with methods such as:

getTrainSet
getTestSet
getDevelopmentSet
getLicense

I also considered adding a method that returns a CollectionReaderDescription, but that would basically mean that the dataset module may end up depending on all the IO modules, which I wouldn't want.

So for the moment, I would suggest just adding new datasets that can be downloaded and then see if there are any commonalities that can be abstracted further.

About Tiger: I don't know. The website is clearly built in such a way that it is expected the user sees the license agreement before downloading the corpus. I am somewhat hesitant to download it via a deep link. However, if we would also download the license text (and maybe eventually expose it over a getter)...

Horsmann commented 8 years ago

I added the brown corpus in TEI format. I hope this isn't a problem? I am retrieving it from the NLTK website. Do you want to add only well-known data sets? I think you can add countless smaller corpora for what-ever but I am wondering who would want to use them. Maybe as getting-started for beginners this might be useful to get their hands quickly on some full-size corpora.

reckart commented 8 years ago

@Horsmann I have no reservations against adding most of the publicly accessible resources.

I harbour some doubts only about such resources that are obviously not meant to be downloaded before agreeing to a license (i.e. only via deep linking). Tiger is such a case. I think there are some others where you have to fill in a form and then are taken to a page which is in principle publicly accessible (i.e. there is no authentication and can directly access it if you know the URL), but you normally would never get there without first filling in the form.

NLTK resources are IMHO absolutely no problem.

reckart commented 8 years ago

@Horsmann as for downloading from GitHub: there is no need to clone the repo. You can download any state of a repository via a ZIP URL, e.g.

https://github.com/amir-zeldes/gum/archive/747b4d51b843fa09e3c3f4af58b48820c34fb0ca.zip

reckart commented 8 years ago

@Horsmann The Brown Corpus TEI XML dataset seems to include a big file (Corpus.xml) and various small files. Which of these do you normally use?

Horsmann commented 8 years ago

Oh right the file contains the corpus twice. The fat-file Corpus.xml (2x MB) is the full corpus as single file. The smaller ones are the single chapters. All small files == Corpus.xml I guess we could just delete the Corpus.xml that allows more flexibility in choosing which parts one wants to use.

reckart commented 8 years ago

@Horsmann ok. I have started converting the loading of the Brown corpus to the improved Dataset API - but this needs a bit more work.

Based on your feedback, I would suggest we add the small files into both, the trainingSet and leave testSet and developmentSetempty.

I would also suggest that we return only a single Dataset from the loader method then which covers the small files. In the case of the UD, I actually return a list of Datasets, one for each language contained in the UD.

Horsmann commented 8 years ago

@reckart I added a new corpus but the fetch code does not seem to wait until the download is finished. The file that is download is quite large 45MB, but it only downloads a small part of it and fails then that the archive is not in the expected file format. I am not sure what the problem is :/

reckart commented 8 years ago

Network problems? I'll have a look. Meanwhile, please check your Eclipse settings for XML formatting. They should correspond to these (below) to avoid messing up the license header in the XML files. 2016-08-06_18-40-13

reckart commented 8 years ago

Well, the handle URL is not the final download location - it redirects. Java URL cannot handle redirects. So we need to use the final URL.

$ curl -I http://hdl.handle.net/11022/0000-0000-91AE-8
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: https://corpora.uni-hamburg.de:8443/fedora/objects/file:hdt_hdt-conll/datastreams/hdt-conll-tar-xz/content?asOfDateTime=2016-02-17T15:38:47.643Z&amp;download=true
Expires: Sun, 07 Aug 2016 16:44:29 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 417
Date: Sat, 06 Aug 2016 16:44:29 GMT

dkpro / dkpro-core

Add support for loading/caching corpora/datasets #911