aiidateam / disk-objectstore

An implementation of an efficient "object store" (actually, a key-value store) writing files on disk and not requiring a running server
https://disk-objectstore.readthedocs.io
MIT License
15 stars 8 forks source link

Collection of a number of important efficiency improvements #96

Closed giovannipizzi closed 3 years ago

giovannipizzi commented 4 years ago

This PR collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together.

In particular:

As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes:

6714808it [00:24, 274813.02it/s]
All hashkeys listed in 24.444787740707397s.
Listing objects: 100%|████████| 6714808/6714808 [00:06<00:00, 978896.65it/s]
Copy objects: 100%|███████████| 6714808/6714808 [48:15<00:00, 2319.08it/s]
Final flush: 100%|████████████| 63236/63236 [00:07<00:00, 8582.35it/s]
Everything re-exported in 2960.980943918228s.

I think this is a excellent result. This can be compared to:

giovannipizzi commented 4 years ago

Anybody brave enough to review this PR? :-) I'm happy to explain the changes in a call, if it helps.

codecov[bot] commented 4 years ago

Codecov Report

Merging #96 into develop will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop       #96    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            7         7            
  Lines         1199      1502   +303     
==========================================
+ Hits          1199      1502   +303     
Impacted Files Coverage Δ
disk_objectstore/__init__.py 100.00% <100.00%> (ø)
disk_objectstore/container.py 100.00% <100.00%> (ø)
disk_objectstore/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e7e3627...34d6671. Read the comment docs.

giovannipizzi commented 4 years ago

@chrisjsewell If possible I prefer merging (rather than squashing or rebasing) since I've been referencing the individual commits in multiple issues

chrisjsewell commented 3 years ago

Heya, gradually getting through all the code, I'm just going to update notes here as I go

tests/test_benchmark.py::test_pack_write
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.4833  0.4833  disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1       0.0484  0.0484  0.4670  0.4670  disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001   0.0180  0.0000  0.2115  0.0000  disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003   0.1005  0.0000  0.1005  0.0000  ~:0(<built-in method posix.stat>)
10002   0.0145  0.0000  0.0799  0.0000  disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id)

Documention notes:

Current experience (with AiiDA) shows that it's actually not so good to use two levels of nesting.

out of interest, why?

the move operation should be hopefully a fast atomic operation on most filesystems

are there any you know of where this is not the case?

add tox.ini:

[tox]
envlist = py37

[testenv]
usedevelop=True

[testenv:py{35,36,37,38}]
description = run pytest
extras = dev
commands = pytest {posargs:--cov=disk_objectstore}

[testenv:py{36,37,38}-pre-commit]
description = run pre-commit
extras = dev
commands = pre-commit {posargs:run}

[testenv:py{35,36,37,38}-ipython]
description = start an ipython console
deps =
    ipython
commands = ipython

Profiling snapshot (for later discussion):

$ tox -e py37 -- --benchmark-only --benchmark-cprofile=cumtime

tests/test_benchmark.py::test_has_objects
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0048  0.0048  0.4209  0.4209  disk-objectstore/disk_objectstore/container.py:778(has_objects)
10001   0.0322  0.0000  0.4123  0.0000  disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
6       0.0000  0.0000  0.0896  0.0149  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3503(__iter__)
6       0.0000  0.0000  0.0886  0.0148  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3528(_execute_and_instances)
8       0.0000  0.0000  0.0874  0.0109  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/base.py:943(execute)
7       0.0000  0.0000  0.0872  0.0125  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/sql/elements.py:296(_execute_on_connection)
7       0.0001  0.0000  0.0872  0.0125  disk-objectstore/.tox/py37/lib/python3.7/site-packages

tests/test_benchmark.py::test_list_all_loose
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.0000  0.0000  ~:0(<method 'disable' of '_lsprof.Profiler' objects>)

tests/test_benchmark.py::test_list_all_packed
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.0000  0.0000  ~:0(<method 'disable' of '_lsprof.Profiler' objects>)

tests/test_benchmark.py::test_loose_read
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0019  0.0019  0.1003  0.1003  disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
1001    0.0046  0.0000  0.0927  0.0001  disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
1000    0.0321  0.0000  0.0321  0.0000  ~:0(<built-in method io.open>)

tests/test_benchmark.py::test_pack_read
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0089  0.0089  0.1700  0.1700  disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
10001   0.0300  0.0000  0.1407  0.0000  disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
10001   0.0146  0.0000  0.0677  0.0000  disk-objectstore/disk_objectstore/utils.py:922(detect_where_sorted)
20004   0.0051  0.0000  0.0498  0.0000  ~:0(<built-in method builtins.next>)
10001   0.0030  0.0000  0.0447  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1006(__iter__)
10001   0.0071  0.0000  0.0417  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1320(fetchone)
20000   0.0099  0.0000  0.0253  0.0000  disk-objectstore/disk_objectstore/utils.py:385(_update_pos)
10000   0.0093  0.0000  0.0245  0.0000  disk-objectstore/disk_objectstore/utils.py:365(__init__)
10001   0.0028  0.0000  0.0237  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1213(_fetchone_impl)
10001   0.0209  0.0000  0.0209  0.0000  ~:0(<method 'fetchone' of 'sqlite3.Cursor' objects>)
10000   0.0059  0.0000  0.0204  0.0000  disk-objectstore/disk_objectstore/utils.py:400(read)
20000   0.0154  0.0000  0.0154  0.0000  ~:0(<method 'tell' of '_io.BufferedReader' objects>)
10000   0.0080  0.0000  0.0109  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages

tests/test_benchmark.py::test_loose_write
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0018  0.0018  0.5730  0.5730  disk-objectstore/tests/test_benchmark.py:28(write_loose)
1000    0.0039  0.0000  0.5710  0.0006  disk-objectstore/disk_objectstore/container.py:842(add_object)
1000    0.0051  0.0000  0.5671  0.0006  disk-objectstore/disk_objectstore/container.py:851(add_streamed_object)
1000    0.0150  0.0000  0.4302  0.0004  disk-objectstore/disk_objectstore/utils.py:158(__exit__)
1000    0.0079  0.0000  0.2387  0.0002  disk-objectstore/disk_objectstore/utils.py:852(safe_flush_to_disk)
2000    0.0052  0.0000  0.1639  0.0001  disk-objectstore/disk_objectstore/utils.py:870(<lambda>)
2000    0.1587  0.0001  0.1587  0.0001  ~:0(<built-in method posix.fsync>)
1000    0.0051  0.0000  0.1083  0.0001  disk-objectstore/disk_objectstore/utils.py:141(__enter__)
2000    0.1044  0.0001  0.1044  0.0001  ~:0(<built-in method io.open>)

tests/test_benchmark.py::test_pack_write
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.4961  0.4961  disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1       0.0522  0.0522  0.4945  0.4945  disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001   0.0193  0.0000  0.2252  0.0000  disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003   0.1072  0.0000  0.1072  0.0000  ~:0(<built-in method posix.stat>)
10002   0.0150  0.0000  0.0848  0.0000  disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id)