Open jonmay opened 8 years ago
Thanks for the issue. What would your suggested resolution look like?
Here's the relevant section:
TSV_LINE=$(grep "${LDC_CORPUS}" "${DOWNLOAD_FILE}")
CORPUS_URL=$(cut -f 5 <<< "${TSV_LINE}")
CORPUS_NAME=$(cut -f 2 <<< "${TSV_LINE}" | tr ' ' '_')
CORPUS_FILE=$(cut -f 6 <<< "${TSV_LINE}" | sed 's,.tgz$,,')
LDC_CORPUS_FILENAME="${LDC_CORPUS}__${CORPUS_NAME}__${CORPUS_FILE}.tgz"
filenames created by this script are somewhat abnormal.
e.g. LDC2016E75, which is described in the 'file name' column of the ldc downloads page (an imperfect guess at the true filename that would be downloaded by web interface) as 'LDC2016E75_LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_Frame_Unsequestered' is downloaded by this script as 'LDC2016E75LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_FrameLDC2016E75_LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_Frame_Unsequestered.tgz'
Note a) the doubling of the entry, and b) the extra underscore in first usage.
Additionally, the script seems to be hard-coded to produce .tgz files but not all files come from LDC as .tgz. This is mostly a bug in LDC's presentation, since i haven't found a way to predict ahead of time what the file name will be; a simple kludge in the python version of this script is to allow the user to determine the filename.