dowobeha / ldc_downloader

Script to download corpora from the Linguistic Data Consortium (LDC)
GNU General Public License v3.0
31 stars 10 forks source link

file name munging #1

Open jonmay opened 8 years ago

jonmay commented 8 years ago

filenames created by this script are somewhat abnormal.

e.g. LDC2016E75, which is described in the 'file name' column of the ldc downloads page (an imperfect guess at the true filename that would be downloaded by web interface) as 'LDC2016E75_LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_Frame_Unsequestered' is downloaded by this script as 'LDC2016E75LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_FrameLDC2016E75_LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_Frame_Unsequestered.tgz'

Note a) the doubling of the entry, and b) the extra underscore in first usage.

Additionally, the script seems to be hard-coded to produce .tgz files but not all files come from LDC as .tgz. This is mostly a bug in LDC's presentation, since i haven't found a way to predict ahead of time what the file name will be; a simple kludge in the python version of this script is to allow the user to determine the filename.

dowobeha commented 8 years ago

Thanks for the issue. What would your suggested resolution look like?

dowobeha commented 8 years ago

Here's the relevant section:

TSV_LINE=$(grep "${LDC_CORPUS}" "${DOWNLOAD_FILE}") CORPUS_URL=$(cut -f 5 <<< "${TSV_LINE}") CORPUS_NAME=$(cut -f 2 <<< "${TSV_LINE}" | tr ' ' '_') CORPUS_FILE=$(cut -f 6 <<< "${TSV_LINE}" | sed 's,.tgz$,,') LDC_CORPUS_FILENAME="${LDC_CORPUS}__${CORPUS_NAME}__${CORPUS_FILE}.tgz"