Permissions error on /tmp/ir_dataset directory due to multiple users on the same server

mitgosp commented 1 year ago

Describe the bug When more than one users on the same server or device use the ir_datasets to fetch documents, then the permission denied error might be encountered if one of the users does not have write access to the already created directory

Affected dataset(s) This issue does not affect datasets

To Reproduce Steps to reproduce the behavior:

User A runs a script that imports some documents using ir_datasets
User B who is on the same system performs the same actions
User B is part of the Others group in the system and hence does not have write permissions to the already existing /tmp/ir_datasets directory
User B sees the following error: PermissionError: [Errno 13] Permission denied: '/tmp/ir_datasets/tmp3sn3tbic'

Expected behavior When multiple users are using the package on the same device, some additional checks would need to be in place to avoid permission errors. For example, the ir_directory directory that is created for tmp files could be prefaced by a username to avoid such conflicts.

Additional context This issue can be bypassed by utilizing the IR_DATASETS_TMP environment variable.

yuenherny commented 1 year ago

I ran into same issue, but not quite sure if it is the same bug as yours @mitgosp

Tried running:

import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for scoreddoc in train.scoreddocs_iter():
    scoreddoc

After download finished (45 mins), got this error:

[WARNING] Download failed: [WinError 5] Access is denied: 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf.tmp' -> 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf'
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?826d5c40-2262-4c52-9508-d46bffe9a76c)
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:69, in Cache.verify(self)
     68 with self._streamer.stream() as stream:
---> 69     shutil.copyfileobj(stream, f)
     70 f.close() # close file before move... Needed because of Windows

File ~\AppData\Local\Programs\Python\Python310\lib\shutil.py:195, in copyfileobj(fsrc, fdst, length)
    194 while True:
--> 195     buf = fsrc_read(length)
    196     if not buf:

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:35, in IterStream.readinto(self, b)
     34 l = len(b) - pos  # We're supposed to return at most this much
---> 35 chunk = self.leftover or next(self.it)
     36 output, self.leftover = chunk[:l], chunk[l:]

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\datasets\msmarco_passage.py:52, in ExtractQidPid.__iter__(self)
     51 def __iter__(self):
---> 52     with self._streamer.stream() as stream:
     53         for line in _logger.pbar(stream, desc='extracting QID/PID pairs', unit='pair'):

File ~\AppData\Local\Programs\Python\Python310\lib\contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    134 try:
--> 135     return next(self.gen)
...
-> 1206     self._accessor.unlink(self)
   1207 except FileNotFoundError:
   1208     if not missing_ok:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp2'

P/s: Full cell output:

[INFO] Please confirm you agree to the MSMARCO data usage agreement found at <http://www.msmarco.org/dataset.aspx>
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz, you can symlink it here to avoid downloading it again: C:\Users\USER\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831
[INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz
[INFO] [finished] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz: [45:28] [687MB] [252kB/s]
[WARNING] Download failed: [WinError 5] Access is denied: 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf.tmp' -> 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf'
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?826d5c40-2262-4c52-9508-d46bffe9a76c)
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:69, in Cache.verify(self)
     68 with self._streamer.stream() as stream:
---> 69     shutil.copyfileobj(stream, f)
     70 f.close() # close file before move... Needed because of Windows

File ~\AppData\Local\Programs\Python\Python310\lib\shutil.py:195, in copyfileobj(fsrc, fdst, length)
    194 while True:
--> 195     buf = fsrc_read(length)
    196     if not buf:

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:35, in IterStream.readinto(self, b)
     34 l = len(b) - pos  # We're supposed to return at most this much
---> 35 chunk = self.leftover or next(self.it)
     36 output, self.leftover = chunk[:l], chunk[l:]

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\datasets\msmarco_passage.py:52, in ExtractQidPid.__iter__(self)
     51 def __iter__(self):
---> 52     with self._streamer.stream() as stream:
     53         for line in _logger.pbar(stream, desc='extracting QID/PID pairs', unit='pair'):

File ~\AppData\Local\Programs\Python\Python310\lib\contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    134 try:
--> 135     return next(self.gen)
...
-> 1206     self._accessor.unlink(self)
   1207 except FileNotFoundError:
   1208     if not missing_ok:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp2'

Screenshot:

yuenherny commented 1 year ago

@mitgosp Sorry if this is a stupid question, but how do I utilize the IR_DATASETS_TMP environment variable to bypass this issue?

seanmacavaney commented 1 year ago

Hi @yuenherny -- it looks like this is a different issue.

Do you have multiple processes open using ir_datasets? (E.g., multiple notebook instances)? As files are downloading, only a single process can access them on Windows.

yuenherny commented 1 year ago

Hi @seanmacavaney , thanks for the prompt response.

Nope, I guess the process is open because I tried to download multiple parts of the dataset - queries, scorreddocs, docs, qrels in sequence in my notebook - and when one hits an error, the process isn't closed automatically.

Now that I managed to download (after restarting my laptop), I get another error:

[INFO] Please confirm you agree to the MSMARCO data usage agreement found at <http://www.msmarco.org/dataset.aspx>
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz, you can symlink it here to avoid downloading it again: C:\Users\USER\.ir_datasets\downloads\31644046b18952c1386cd4564ba2ae69
[INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
[INFO] [finished] https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz: [59:56] [954MB] [265kB/s]
[WARNING] Download failed: Expected md5 hash to be 31644046b18952c1386cd4564ba2ae69 but got 9a1336b80866927a64cd43a5d820f277

Possibly due to incomplete download?

seanmacavaney commented 1 year ago

and when one hits an error, the process isn't closed automatically

Gotcha -- thanks! This is a bug, as it should close the file in this case so others can use it. I'll look into fixing this.

Possibly due to incomplete download?

Yep, something went wrong with the download. It's not safe to use this version because the contents could be different, or you may be missing some records.

allenai / ir_datasets

Permissions error on /tmp/ir_dataset directory due to multiple users on the same server #206