file name munging - Githubissues

jonmay commented 8 years ago

filenames created by this script are somewhat abnormal.

e.g. LDC2016E75, which is described in the 'file name' column of the ldc downloads page (an imperfect guess at the true filename that would be downloaded by web interface) as 'LDC2016E75_LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_Frame_Unsequestered' is downloaded by this script as 'LDC2016E75LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_FrameLDC2016E75_LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_Frame_Unsequestered.tgz'

Note a) the doubling of the entry, and b) the extra underscore in first usage.

Additionally, the script seems to be hard-coded to produce .tgz files but not all files come from LDC as .tgz. This is mostly a bug in LDC's presentation, since i haven't found a way to predict ahead of time what the file name will be; a simple kludge in the python version of this script is to allow the user to determine the filename.

dowobeha commented 8 years ago

Thanks for the issue. What would your suggested resolution look like?

dowobeha commented 8 years ago

Here's the relevant section:

TSV_LINE=$(grep "${LDC_CORPUS}" "${DOWNLOAD_FILE}") CORPUS_URL=$(cut -f 5 <<< "${TSV_LINE}") CORPUS_NAME=$(cut -f 2 <<< "${TSV_LINE}" | tr ' ' '_') CORPUS_FILE=$(cut -f 6 <<< "${TSV_LINE}" | sed 's,.tgz$,,') LDC_CORPUS_FILENAME="${LDC_CORPUS}__${CORPUS_NAME}__${CORPUS_FILE}.tgz"

dowobeha / ldc_downloader

file name munging #1