Closed reckart closed 7 years ago
@Horsmann I just noticed your mail to the corpora list. A few days ago, I have started adding a DatasetLoader
class to DKPro Core for some local experiments. Prompted by your mail, I have committed it. Would be great if we could extend this together for additional datasets.
Yes, sure :). Any corpus that is directly reachable by a link can be added? e.g. Tiger http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/download/start.html or http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/download/tigercorpus-2.2.xml.tar.gz
Second question would be in which data format ? Do we add the same corpus multiple times if it is available in various formats?
Right now, the class only handles downloading the files, verifying their checksums, and caching them. If a file is e.g. an archive with multiple corpus formats, it makes no difference. The loader only returns the path where it has downloaded the dataset.
I am considering various way of improving this very simple approach, e.g. by wrapping the dataset in a class with methods such as:
I also considered adding a method that returns a CollectionReaderDescription, but that would basically mean that the dataset module may end up depending on all the IO modules, which I wouldn't want.
So for the moment, I would suggest just adding new datasets that can be downloaded and then see if there are any commonalities that can be abstracted further.
About Tiger: I don't know. The website is clearly built in such a way that it is expected the user sees the license agreement before downloading the corpus. I am somewhat hesitant to download it via a deep link. However, if we would also download the license text (and maybe eventually expose it over a getter)...
I added the brown corpus in TEI format. I hope this isn't a problem? I am retrieving it from the NLTK website. Do you want to add only well-known data sets? I think you can add countless smaller corpora for what-ever but I am wondering who would want to use them.
Maybe as getting-started
for beginners
this might be useful to get their hands quickly on some full-size corpora.
@Horsmann I have no reservations against adding most of the publicly accessible resources.
I harbour some doubts only about such resources that are obviously not meant to be downloaded before agreeing to a license (i.e. only via deep linking). Tiger is such a case. I think there are some others where you have to fill in a form and then are taken to a page which is in principle publicly accessible (i.e. there is no authentication and can directly access it if you know the URL), but you normally would never get there without first filling in the form.
NLTK resources are IMHO absolutely no problem.
@Horsmann as for downloading from GitHub: there is no need to clone the repo. You can download any state of a repository via a ZIP URL, e.g.
https://github.com/amir-zeldes/gum/archive/747b4d51b843fa09e3c3f4af58b48820c34fb0ca.zip
@Horsmann The Brown Corpus TEI XML dataset seems to include a big file (Corpus.xml) and various small files. Which of these do you normally use?
Oh right the file contains the corpus twice. The fat-file Corpus.xml
(2x MB) is the full corpus as single file. The smaller ones are the single chapters. All small files == Corpus.xml
I guess we could just delete the Corpus.xml that allows more flexibility in choosing which parts one wants to use.
@Horsmann ok. I have started converting the loading of the Brown corpus to the improved Dataset
API - but this needs a bit more work.
Based on your feedback, I would suggest we add the small files into both, the trainingSet
and leave testSet
and developmentSet
empty.
I would also suggest that we return only a single Dataset from the loader method then which covers the small files. In the case of the UD, I actually return a list of Datasets, one for each language contained in the UD.
@reckart I added a new corpus but the fetch code does not seem to wait until the download is finished. The file that is download is quite large 45MB, but it only downloads a small part of it and fails then that the archive is not in the expected file format. I am not sure what the problem is :/
Network problems? I'll have a look. Meanwhile, please check your Eclipse settings for XML formatting. They should correspond to these (below) to avoid messing up the license header in the XML files.
Well, the handle URL is not the final download location - it redirects. Java URL cannot handle redirects. So we need to use the final URL.
$ curl -I http://hdl.handle.net/11022/0000-0000-91AE-8
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: https://corpora.uni-hamburg.de:8443/fedora/objects/file:hdt_hdt-conll/datastreams/hdt-conll-tar-xz/content?asOfDateTime=2016-02-17T15:38:47.643Z&download=true
Expires: Sun, 07 Aug 2016 16:44:29 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 417
Date: Sat, 06 Aug 2016 16:44:29 GMT
Add support for loading/caching corpora/datasets. A new module could be added that provides some class with methods to download (and cache) publicly accessible datasets e.g. for use in unit tests, but potentially also for experiments.
Done
data
role to cover all data files.getAllData()
should use that info then instead of aggregating over train/test/dev.split
action should remain inside artifact / if it can be improved to take into accountexplode
action such that the relative paths are the same for both. Maybeexplode
could update some kind of base-dir information for the artifact thatsplit
then can pick up -> no longer an action but a default method on the Dataset interface.META-INF/org.dkpro.core/datasets.txt
encoding
- field is there, but most datasets still lack encoding informationSee also