dowobeha / ldc_downloader

Script to download corpora from the Linguistic Data Consortium (LDC)
GNU General Public License v3.0
31 stars 10 forks source link

LDC package downloading tool

Bash script by Lane Schwartz (dowobeha@gmail.com)

based on Python code (https://github.com/jonmay/ldcdl) by Jonathan May (jonmay@isi.edu)

04 March 2016

Requirements:

Usage:

$ ./download-ldc-corpora LDC99T42
2016-26-04 16:26:49 UTC-06  Enter LDC login credentials                                       ...

Enter email address: you@somewhere.edu
Enter Password: 

2016-26-04 16:26:54 UTC-06  Accessing LDC login page                                          ...
2016-26-04 16:26:54 UTC-06  Logging in to your LDC account                                    ...
2016-26-04 16:26:55 UTC-06  Accessing list of your LDC corpora                                ...

A list of the LDC corpora associated with your account has been saved to corpora.tsv

2016-26-04 16:26:56 UTC-06  Downloading LDC99T42 from https://catalog.ldc.upenn.edu/download/...

Subsequent runs will use the credentials cached in cookies.txt, if that file exists.

So, to kick off a job in the background to download all of your LDC corpora:

$ ./download-ldc-corpora
2016-26-04 16:26:49 UTC-06  Enter LDC login credentials                                       ...

Enter email address: you@somewhere.edu
Enter Password: 

2016-26-04 16:26:54 UTC-06  Accessing LDC login page                                          ...
2016-26-04 16:26:54 UTC-06  Logging in to your LDC account                                    ...
2016-26-04 16:26:55 UTC-06  Accessing list of your LDC corpora                                ...

A list of the LDC corpora associated with your account has been saved to corpora.tsv

$ nohup ./download-ldc-corpora $(cut -f 1 corpora.tsv) &> download-ldc-corpora.log &

Known Issues