Open mitgosp opened 1 year ago
I ran into same issue, but not quite sure if it is the same bug as yours @mitgosp
Tried running:
import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for scoreddoc in train.scoreddocs_iter():
scoreddoc
After download finished (45 mins), got this error:
[WARNING] Download failed: [WinError 5] Access is denied: 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf.tmp' -> 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf'
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?826d5c40-2262-4c52-9508-d46bffe9a76c)
---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:69, in Cache.verify(self)
68 with self._streamer.stream() as stream:
---> 69 shutil.copyfileobj(stream, f)
70 f.close() # close file before move... Needed because of Windows
File ~\AppData\Local\Programs\Python\Python310\lib\shutil.py:195, in copyfileobj(fsrc, fdst, length)
194 while True:
--> 195 buf = fsrc_read(length)
196 if not buf:
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:35, in IterStream.readinto(self, b)
34 l = len(b) - pos # We're supposed to return at most this much
---> 35 chunk = self.leftover or next(self.it)
36 output, self.leftover = chunk[:l], chunk[l:]
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\datasets\msmarco_passage.py:52, in ExtractQidPid.__iter__(self)
51 def __iter__(self):
---> 52 with self._streamer.stream() as stream:
53 for line in _logger.pbar(stream, desc='extracting QID/PID pairs', unit='pair'):
File ~\AppData\Local\Programs\Python\Python310\lib\contextlib.py:135, in _GeneratorContextManager.__enter__(self)
134 try:
--> 135 return next(self.gen)
...
-> 1206 self._accessor.unlink(self)
1207 except FileNotFoundError:
1208 if not missing_ok:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp2'
P/s: Full cell output:
[INFO] Please confirm you agree to the MSMARCO data usage agreement found at <http://www.msmarco.org/dataset.aspx>
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz, you can symlink it here to avoid downloading it again: C:\Users\USER\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831
[INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz
[INFO] [finished] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz: [45:28] [687MB] [252kB/s]
[WARNING] Download failed: [WinError 5] Access is denied: 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf.tmp' -> 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf'
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?826d5c40-2262-4c52-9508-d46bffe9a76c)
---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:69, in Cache.verify(self)
68 with self._streamer.stream() as stream:
---> 69 shutil.copyfileobj(stream, f)
70 f.close() # close file before move... Needed because of Windows
File ~\AppData\Local\Programs\Python\Python310\lib\shutil.py:195, in copyfileobj(fsrc, fdst, length)
194 while True:
--> 195 buf = fsrc_read(length)
196 if not buf:
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:35, in IterStream.readinto(self, b)
34 l = len(b) - pos # We're supposed to return at most this much
---> 35 chunk = self.leftover or next(self.it)
36 output, self.leftover = chunk[:l], chunk[l:]
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\datasets\msmarco_passage.py:52, in ExtractQidPid.__iter__(self)
51 def __iter__(self):
---> 52 with self._streamer.stream() as stream:
53 for line in _logger.pbar(stream, desc='extracting QID/PID pairs', unit='pair'):
File ~\AppData\Local\Programs\Python\Python310\lib\contextlib.py:135, in _GeneratorContextManager.__enter__(self)
134 try:
--> 135 return next(self.gen)
...
-> 1206 self._accessor.unlink(self)
1207 except FileNotFoundError:
1208 if not missing_ok:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp2'
Screenshot:
@mitgosp Sorry if this is a stupid question, but how do I utilize the IR_DATASETS_TMP
environment variable to bypass this issue?
Hi @yuenherny -- it looks like this is a different issue.
Do you have multiple processes open using ir_datasets? (E.g., multiple notebook instances)? As files are downloading, only a single process can access them on Windows.
Hi @seanmacavaney , thanks for the prompt response.
Nope, I guess the process is open because I tried to download multiple parts of the dataset - queries
, scorreddocs
, docs
, qrels
in sequence in my notebook - and when one hits an error, the process isn't closed automatically.
Now that I managed to download (after restarting my laptop), I get another error:
[INFO] Please confirm you agree to the MSMARCO data usage agreement found at <http://www.msmarco.org/dataset.aspx>
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz, you can symlink it here to avoid downloading it again: C:\Users\USER\.ir_datasets\downloads\31644046b18952c1386cd4564ba2ae69
[INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
[INFO] [finished] https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz: [59:56] [954MB] [265kB/s]
[WARNING] Download failed: Expected md5 hash to be 31644046b18952c1386cd4564ba2ae69 but got 9a1336b80866927a64cd43a5d820f277
Possibly due to incomplete download?
and when one hits an error, the process isn't closed automatically
Gotcha -- thanks! This is a bug, as it should close the file in this case so others can use it. I'll look into fixing this.
Possibly due to incomplete download?
Yep, something went wrong with the download. It's not safe to use this version because the contents could be different, or you may be missing some records.
Describe the bug When more than one users on the same server or device use the ir_datasets to fetch documents, then the permission denied error might be encountered if one of the users does not have write access to the already created directory
Affected dataset(s) This issue does not affect datasets
To Reproduce Steps to reproduce the behavior:
PermissionError: [Errno 13] Permission denied: '/tmp/ir_datasets/tmp3sn3tbic'
Expected behavior When multiple users are using the package on the same device, some additional checks would need to be in place to avoid permission errors. For example, the ir_directory directory that is created for tmp files could be prefaced by a username to avoid such conflicts.
Additional context This issue can be bypassed by utilizing the IR_DATASETS_TMP environment variable.