CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

maybe_download_dataset_data() fails with example code in Read_conllu_Files.ipynb #227

Closed frreiss closed 2 years ago

frreiss commented 2 years ago

Cell 3 of Read_connlu_Files.ipynb contains this line of code:

conll_09_path = tp.io.conll.maybe_download_dataset_data(BASE_DIR, conll_09_test_data_url)

When I attempt to run this notebook on a freshly checked-out copy of the repository, I get the following error:

-------------------------------------------------------------------------
FileNotFoundError                       Traceback (most recent call last)
<ipython-input-4-2ca899c89315> in <module>
      1 # download the files if they have not already been downloaded
----> 2 conll_09_path = tp.io.conll.maybe_download_dataset_data(BASE_DIR, conll_09_test_data_url)
      3 conllu_ewt_path = tp.io.conll.maybe_download_dataset_data(BASE_DIR, ewt_dev_url)
      4 
      5 # if you already have access to the full conll 2009 dataset, name the file accordingly and uncomment this line

~/pd/tep-0_1/notebooks/../text_extensions_for_pandas/io/conll.py in maybe_download_dataset_data(target_dir, document_url, alternate_name)
   1327         alternate_name is None or not os.path.exists(full_path)
   1328     ):
-> 1329         with ZipFile(full_path, "r") as zipf:
   1330             fnames = zipf.namelist()
   1331             if alternate_name is not None and alternate_name in fnames:

~/opt/miniconda3/envs/pd/lib/python3.8/zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
   1249             while True:
   1250                 try:
-> 1251                     self.fp = io.open(file, filemode)
   1252                 except OSError:
   1253                     if filemode in modeDict:

FileNotFoundError: [Errno 2] No such file or directory: 'CoNLL_u_test_inputs/CoNLL2009-ST-English-trial.zip'

The logic that is raising this exception appears to be incorrect. The code assumes that a file that it has not downloaded exists and attempts to unzip the file without downloading it.

The fix for this problem should include a regression test to ensure the problem does not reappear in the future.