aiidateam / disk-objectstore

An implementation of an efficient "object store" (actually, a key-value store) writing files on disk and not requiring a running server
https://disk-objectstore.readthedocs.io
MIT License
15 stars 8 forks source link

Concurrency issue on Windows #37

Closed giovannipizzi closed 4 years ago

giovannipizzi commented 4 years ago

I still get randomly this error in some of the runners, sometimes, on Windows:

** STDERR OF WORKER 3
Traceback (most recent call last):
  File "D:\a\disk-objectstore\disk-objectstore\tests\concurrent_tests\periodic_worker.py", line 192, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "c:\hostedtoolcache\windows\python\3.5.4\x64\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\hostedtoolcache\windows\python\3.5.4\x64\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\hostedtoolcache\windows\python\3.5.4\x64\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\hostedtoolcache\windows\python\3.5.4\x64\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "D:\a\disk-objectstore\disk-objectstore\tests\concurrent_tests\periodic_worker.py", line 137, in main
    retrieved_content = container.get_objects_content(all_hashkeys, skip_if_missing=False)
  File "d:\a\disk-objectstore\disk-objectstore\disk_objectstore\container.py", line 552, in get_objects_content
    for obj_hashkey, stream, _ in triplets:
  File "d:\a\disk-objectstore\disk-objectstore\disk_objectstore\container.py", line 465, in get_object_stream_generator
    last_open_file = open(obj_path, mode='rb')
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmpb542gz99\\container\\loose\\2c\\4aa1cf7438e6fddb5c6c6996a4fb89f01702f6ae9b6ce8314dc14c50545eab'

This is in the worker, while trying to read loose files. The object should have been already created and synced. So this is probably happening while the file is being deleted, maybe at the same time? It is strange that this gives a PermissionError however, maybe it happens for concurrent deletion and opening?

giovannipizzi commented 4 years ago

Check the branch check_concurrency where I repeat the test 20 times (10 with packing and 10 without) to show the problem with higher probability.

giovannipizzi commented 4 years ago

I got also an error in the packer (!?)

  File "D:\a\disk-objectstore\disk-objectstore\tests\concurrent_tests\periodic_packer.py", line 41, in main
411

412
    container.pack_all_loose(compress=compress_packs)
413

414
  File "d:\a\disk-objectstore\disk-objectstore\disk_objectstore\container.py", line 930, in pack_all_loose
415

416
    with open(self._get_loose_path_from_hashkey(loose_hashkey), 'rb') as loose_handle:
417

418
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmpaqcjj8fl\\container\\loose\\c7\\8b94468bae2a28a2133497a9b0d2137de283d635647bc5edbc893538ca5eaa'
giovannipizzi commented 4 years ago

I also receive sometimes

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmpzrla06is\\container\\loose\\c1\\9a797fa1fd590cd2e5b42d1cf5f246e29b91684e2f87404b81dc345c7a56a0'

Note also that when I increase the cases in which the same object content is written multiple times (by only allowing for 257 possible values for the objects in the tests for concurrency) the Windows problems occur much more often).

I think these are mostly related to the 'hand-written' code to manage overwrites of existing files, that are not fully atomic. I plan to use now os.replace instead that should be more atomic, see this patch and the corresponding issue on the python tracker [EDIT: see 339e11bb685a1149ba660fab8c5d584b66b17d77 and possibly following commits]

giovannipizzi commented 4 years ago

Note also that this seems quite related to the issue on Mac OS #43, and it happens (as expected) while packing:

WARNING/ERROR: I got a permission error, message: [Errno 13] Permission denied: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmp5piaee02\\container\\loose\\5f\\2a8e79262dd16aa7241355c49643d54c4f215f8e9d2a2eb2697fe71d78c425'
236

237
E           |-> AFTER RE-READING: checksum=5f2a8e79262dd16aa7241355c49643d54c4f215f8e9d2a2eb2697fe71d78c425, meta={'pack_length': 715, 'type': 'packed', 'size': 715, 'pack_compressed': False, 'pack_id': 0, 'pack_offset': 151904}

(the checksum is correct in the pack, after re-reading, as in #43)

giovannipizzi commented 4 years ago

Notes on the new implementation with os.replace. I now get PermissionErrors, and if I try to read again, within the exception it cannot even read.

I can reproduce this with two processes running at the same time on Windows in the same folder:

import os
import psutil

dest_fname = 'DEST.txt'
for i in range(1000):
    this_fname = str(psutil.Process().pid) + '.txt'
    with open(this_fname, 'w') as fhandle:
        fhandle.write('CONTENT')
    try:
        os.replace(this_fname, dest_fname)
    except PermissionError:
        with open(dest_fname, 'rb') as f2:
            print(f2.read())
        raise
    print(this_fname, i)

One of the two crashes with:

Traceback (most recent call last):
  File "Z:\test-conc-replace.py", line 10, in <module>
    os.replace(this_fname, dest_fname)
PermissionError: [WinError 5] Access is denied: '3560.txt' -> 'DEST.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Z:\test-conc-replace.py", line 12, in <module>
    with open(dest_fname, 'rb') as f2:
PermissionError: [Errno 13] Permission denied: 'DEST.txt'

Therefore, os.replace creates the file on windows with modes that are exclusive and do not allow even to open it in read mode.

IMPORTANT: This might be a problem even while creating packed objects! (when running with trust_existing=False).

In addition, of course, this even prevents delete. I can run a concurrent deleted like this:

import os
import psutil

dest_fname = 'DEST.txt'
for i in range(1000):
    this_fname = str(psutil.Process().pid) + '.txt'
    with open(this_fname, 'w') as fhandle:
        fhandle.write('CONTENT')
    try:
        os.replace(this_fname, dest_fname)
    except PermissionError:
        with open(dest_fname, 'rb') as f2:
            print(f2.read())
        raise
    print(this_fname, i)

In some cases I get the same error. In others I get

Traceback (most recent call last):
  File "Z:\test-conc-delete.py", line 7, in <module>
    os.remove(dest_fname)
PermissionError: [WinError 32] The process cannot access the file because it is
being used by another process: 'DEST.txt'
giovannipizzi commented 4 years ago

See this comment on how we could try to solve this issue: https://github.com/giovannipizzi/disk-objectstore/issues/43#issuecomment-657249788

Note however that there might still be at least one corner cas, discussed in the previous comment of this issue: even when not packing, we might have an issue (with trust_existing=false) if two processes are replacing files at the same exact time. This needs to be tested and we need to see if we find a workaround. An ugly one would be to wait and retry, but with some timeout to avoid infinite loops. We would like to avoid to send back an exception to the lib user in this technically valid case.

There are more complex things one could do in reality (e.g. create a special loose object whose name is the checksum plus a uuid, assume everything went OK, and let a later maintenance operation check these special cases, by checking the checksum and deciding which version to keep).