allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
314 stars 42 forks source link

Improved experience when linking non-downloadable content #93

Open seanmacavaney opened 3 years ago

seanmacavaney commented 3 years ago

Describe the solution you'd like Have a separate file structure for non-downloadable files. Improve linking experience by providing a command line utility to link, or by giving the command to link to the user directly.

Will require a migration of existing files and (potentially) a plan for backward compatibility.

Additional context As suggested here: https://github.com/allenai/ir_datasets/issues/89#issuecomment-879869134

seanmacavaney commented 3 years ago

Partially addressing this in #103. Will give a message like this one:

[INFO] If you have a local copy of https://ai2-s2-research-public.s3-us-west-2.amazonaws.com/ir-datasets/c4/en.noclean.checkpoints.tar.gz, you can symlink it to avoid downloading it again, e.g.:
ln -s /path/to/en.noclean.checkpoints.tar.gz /home/sean/.ir_datasets/downloads/eab00c3b5202564da998466198a01298
yuenherny commented 2 years ago

Hi @seanmacavaney , may I know if how does this work in Windows OS? There's no such folder as downloads in \.ir_datasets.

How do I symlink the dataset I downloaded myself? I keep having issues with PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp0' - the downloaded file got deleted and I need to spend another hour to download it again 😅

seanmacavaney commented 2 years ago

I'm not very experienced in Windows. But I think you can make the missing directory by:

mkdir C:\Users\USER\.ir_datasets\downloads

And the download, you can use CURL, I think:

curl.exe --output C:\Users\USER\.ir_datasets\downloads\XXX --url URL

(where XXX is the hash provided in the message and URL is the target URL)

Hope this helps!

yuenherny commented 2 years ago

Hi @seanmacavaney , thanks for the prompt response.

By using curl, it seems like I need to download the file again, which is what I am trying to avoid, since I already have a local copy of the file.

I tried creating the symbolic link by following the instructions here.

  1. In CMD (opened with admin rights):
    mklink C:\Users\<username>\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831 C:\Users\<username>\.ir_datasets\msmarco-passage\top1000.dev.tar.gz
  2. I get this as response:
    symbolic link created for C:\Users\USER\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831 <<===>> C:\Users\USER\.ir_datasets\msmarco-passage\top1000.dev.tar.gz

Then I rerun the .scoreddocs_iter() cell again, but it seems that it is downloading it again? (This time without the symlink instructions tho) image

Right now I am letting the process to finish and see what errors I will encounter after trying out this symlink method.

yuenherny commented 2 years ago

Apparently the software downloads the dataset (again), and this time it kinda hits itself in the foot:

Full error message:

[INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz
[INFO] download error: HTTPSConnectionPool(host='msmarco.blob.core.windows.net', port=443): Read timed out.. Retrying range "121044992-" [2 attempts left]
[INFO] download error: HTTPSConnectionPool(host='msmarco.blob.core.windows.net', port=443): Read timed out.. Retrying range "245940224-" [2 attempts left]
[INFO] [finished] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz: [44:01] [687MB] [260kB/s]
[WARNING] Download failed: [WinError 5] Access is denied: 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmppzhcp6nu.tmp' -> 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmppzhcp6nu'
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?d9307bf7-6f2e-4bcf-8468-b807df104661)
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:69, in Cache.verify(self)
     68 with self._streamer.stream() as stream:
---> 69     shutil.copyfileobj(stream, f)
     70 f.close() # close file before move... Needed because of Windows

File ~\AppData\Local\Programs\Python\Python310\lib\shutil.py:195, in copyfileobj(fsrc, fdst, length)
    194 while True:
--> 195     buf = fsrc_read(length)
    196     if not buf:

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:35, in IterStream.readinto(self, b)
     34 l = len(b) - pos  # We're supposed to return at most this much
---> 35 chunk = self.leftover or next(self.it)
     36 output, self.leftover = chunk[:l], chunk[l:]

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\datasets\msmarco_passage.py:52, in ExtractQidPid.__iter__(self)
     51 def __iter__(self):
---> 52     with self._streamer.stream() as stream:
     53         for line in _logger.pbar(stream, desc='extracting QID/PID pairs', unit='pair'):

File ~\AppData\Local\Programs\Python\Python310\lib\contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    134 try:
--> 135     return next(self.gen)
...
-> 1206     self._accessor.unlink(self)
   1207 except FileNotFoundError:
   1208     if not missing_ok:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp5'

I am guessing that this is a legitimate bug - which is what you mentioned here

Right now I am using soft link in Windows. Trying to see if things are better if I use hard link.

Screenshot: image

yuenherny commented 2 years ago

Tried hard link in Windows via mklink /H C:\Users\USER\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831 C:\Users\USER\.ir_datasets\msmarco-passage\top1000.dev.tar.gz and got Hardlink created for C:\Users\USER\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831 <<===>> C:\Users\USER\.ir_datasets\msmarco-passage\top1000.dev.tar.gz

Restarted kernel and rerun the ipynb from top, but it seems that it still tries to download from the URL again 😅 :

INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz
[INFO] [error] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz: [00:26] [7.32MB] [280kB/s] 

Screenshot: image