Closed giovannipizzi closed 3 years ago
Anybody brave enough to review this PR? :-) I'm happy to explain the changes in a call, if it helps.
Merging #96 into develop will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## develop #96 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 7 7
Lines 1199 1502 +303
==========================================
+ Hits 1199 1502 +303
Impacted Files | Coverage Δ | |
---|---|---|
disk_objectstore/__init__.py | 100.00% <100.00%> (ø) |
|
disk_objectstore/container.py | 100.00% <100.00%> (ø) |
|
disk_objectstore/utils.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update e7e3627...34d6671. Read the comment docs.
@chrisjsewell If possible I prefer merging (rather than squashing or rebasing) since I've been referencing the individual commits in multiple issues
Heya, gradually getting through all the code, I'm just going to update notes here as I go
mention container.clean_storage()
in basic usage
add simple CLI
delete empty folders after container.clean_storage()
as a standalone library, integrated metadata storage would be ideal (e.g. file format, file path)
drop python 3.5
use type annotations (and add mypy CI + py.typed file)
use *
before key-word arguments
use pathlib over os.path
specify "utf8" for config JSON read/write?
checking the size of the pack every time is maybe not performant: can we memoize full packs? (I guess you'd have to be careful to remove a pack from the set, if you removed anything from it):
tests/test_benchmark.py::test_pack_write
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0000 0.0000 0.4833 0.4833 disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1 0.0484 0.0484 0.4670 0.4670 disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001 0.0180 0.0000 0.2115 0.0000 disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003 0.1005 0.0000 0.1005 0.0000 ~:0(<built-in method posix.stat>)
10002 0.0145 0.0000 0.0799 0.0000 disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id)
_get_objects_stream_meta_generator
yield named tuple or dataclass (see https://pypi.org/project/dataclasses/ and https://github.com/python-attrs/attrs), also for the meta
key, this should be a dataclass rather than a dict.
and actually anywhere you are returning dicts, it would be better to return a dataclass (I'm a big fan of TypeScript!), like count_objects
and get_total_size
It feels like _get_objects_stream_meta_generator
could be refactored some what to remove the duplication(pytest is right, it is too complex lol). Perhaps the number of retries for loose_not_found
should be configurable.
https://github.com/aiidateam/disk-objectstore/blob/e7e3627a67e8de2030b231927a127a9fb06ae474/disk_objectstore/container.py#L991-L992 just use a set 🤷 {hashkey}
change get_hash
-> get_hash_cls
Documention notes:
Current experience (with AiiDA) shows that it's actually not so good to use two levels of nesting.
out of interest, why?
the move operation should be hopefully a fast atomic operation on most filesystems
are there any you know of where this is not the case?
add tox.ini:
[tox]
envlist = py37
[testenv]
usedevelop=True
[testenv:py{35,36,37,38}]
description = run pytest
extras = dev
commands = pytest {posargs:--cov=disk_objectstore}
[testenv:py{36,37,38}-pre-commit]
description = run pre-commit
extras = dev
commands = pre-commit {posargs:run}
[testenv:py{35,36,37,38}-ipython]
description = start an ipython console
deps =
ipython
commands = ipython
Profiling snapshot (for later discussion):
$ tox -e py37 -- --benchmark-only --benchmark-cprofile=cumtime
tests/test_benchmark.py::test_has_objects
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0048 0.0048 0.4209 0.4209 disk-objectstore/disk_objectstore/container.py:778(has_objects)
10001 0.0322 0.0000 0.4123 0.0000 disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
6 0.0000 0.0000 0.0896 0.0149 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3503(__iter__)
6 0.0000 0.0000 0.0886 0.0148 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3528(_execute_and_instances)
8 0.0000 0.0000 0.0874 0.0109 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/base.py:943(execute)
7 0.0000 0.0000 0.0872 0.0125 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/sql/elements.py:296(_execute_on_connection)
7 0.0001 0.0000 0.0872 0.0125 disk-objectstore/.tox/py37/lib/python3.7/site-packages
tests/test_benchmark.py::test_list_all_loose
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0000 0.0000 0.0000 0.0000 ~:0(<method 'disable' of '_lsprof.Profiler' objects>)
tests/test_benchmark.py::test_list_all_packed
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0000 0.0000 0.0000 0.0000 ~:0(<method 'disable' of '_lsprof.Profiler' objects>)
tests/test_benchmark.py::test_loose_read
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0019 0.0019 0.1003 0.1003 disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
1001 0.0046 0.0000 0.0927 0.0001 disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
1000 0.0321 0.0000 0.0321 0.0000 ~:0(<built-in method io.open>)
tests/test_benchmark.py::test_pack_read
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0089 0.0089 0.1700 0.1700 disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
10001 0.0300 0.0000 0.1407 0.0000 disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
10001 0.0146 0.0000 0.0677 0.0000 disk-objectstore/disk_objectstore/utils.py:922(detect_where_sorted)
20004 0.0051 0.0000 0.0498 0.0000 ~:0(<built-in method builtins.next>)
10001 0.0030 0.0000 0.0447 0.0000 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1006(__iter__)
10001 0.0071 0.0000 0.0417 0.0000 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1320(fetchone)
20000 0.0099 0.0000 0.0253 0.0000 disk-objectstore/disk_objectstore/utils.py:385(_update_pos)
10000 0.0093 0.0000 0.0245 0.0000 disk-objectstore/disk_objectstore/utils.py:365(__init__)
10001 0.0028 0.0000 0.0237 0.0000 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1213(_fetchone_impl)
10001 0.0209 0.0000 0.0209 0.0000 ~:0(<method 'fetchone' of 'sqlite3.Cursor' objects>)
10000 0.0059 0.0000 0.0204 0.0000 disk-objectstore/disk_objectstore/utils.py:400(read)
20000 0.0154 0.0000 0.0154 0.0000 ~:0(<method 'tell' of '_io.BufferedReader' objects>)
10000 0.0080 0.0000 0.0109 0.0000 disk-objectstore/.tox/py37/lib/python3.7/site-packages
tests/test_benchmark.py::test_loose_write
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0018 0.0018 0.5730 0.5730 disk-objectstore/tests/test_benchmark.py:28(write_loose)
1000 0.0039 0.0000 0.5710 0.0006 disk-objectstore/disk_objectstore/container.py:842(add_object)
1000 0.0051 0.0000 0.5671 0.0006 disk-objectstore/disk_objectstore/container.py:851(add_streamed_object)
1000 0.0150 0.0000 0.4302 0.0004 disk-objectstore/disk_objectstore/utils.py:158(__exit__)
1000 0.0079 0.0000 0.2387 0.0002 disk-objectstore/disk_objectstore/utils.py:852(safe_flush_to_disk)
2000 0.0052 0.0000 0.1639 0.0001 disk-objectstore/disk_objectstore/utils.py:870(<lambda>)
2000 0.1587 0.0001 0.1587 0.0001 ~:0(<built-in method posix.fsync>)
1000 0.0051 0.0000 0.1083 0.0001 disk-objectstore/disk_objectstore/utils.py:141(__enter__)
2000 0.1044 0.0001 0.1044 0.0001 ~:0(<built-in method io.open>)
tests/test_benchmark.py::test_pack_write
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0000 0.0000 0.4961 0.4961 disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1 0.0522 0.0522 0.4945 0.4945 disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001 0.0193 0.0000 0.2252 0.0000 disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003 0.1072 0.0000 0.1072 0.0000 ~:0(<built-in method posix.stat>)
10002 0.0150 0.0000 0.0848 0.0000 disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id)
This PR collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together.
In particular:
clean_storage
now provides an option to VACUUM the DB. VACUUM is also called after repacking. Fixes #94export
function (still existing but deprecated) to aimport_objects
functionpack_all_loose
andimport_objects
provide an option to perform a fsync to disk or not (see also #54 - there are still however calls that always use - or don't use - fsync and full_fsync on Mac). Also,add_objects_to_pack
allows now to choose if you want to commit the changes to the SQLite DB, or not (delegating the responsibility to the caller, but this is important e.g. inimport_objects
: callingcommit
only once at the very end gives a factor of 2 speedup for very big repos).import_objects
provide a callback to e.g. show a progress bar.CallbackStreamWrapper
has been implemented, allowing to provide a callback (e.g. for a progress bar) when streaming a big file.sha1
) in addition to the defaultsha256
(fixes #82). This is faster even if a bit less robust. This was also needed to test completely some feature inimport_objects
, where the logic is optimised if both containers use the same algorithm. By default is still better to use everywhere sha256, also because then all export files that will be generated will use this algorithm and importing will be more efficient.As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes:
I think this is a excellent result. This can be compared to: