General, non-essential improvements

This is a list of desired, but non-essential improvements noted whilst reviewing https://github.com/aiidateam/disk-objectstore/pull/96

[ ] mention container.clean_storage() in basic usage
[x] add simple CLI
[x] ~delete empty folders after container.clean_storage()~ [GP's comment: I don't think it's useful, it's only very few folders that will be soon recreated]
[x] ~as a standalone library, integrated metadata storage would be ideal (e.g. file format, file path)~ [GP's comment: it could be useful, but it's tricky - this is best implemented in codes using this library, see e.g. this AiiDA issue]
[x] drop python 3.5
[x] use type annotations (and add mypy CI + py.typed file)
[ ] use * before key-word arguments
[x] Moved to issue: #149
[ ] specify "utf8" for config JSON read/write?
[ ] checking the size of the pack every time is maybe not performant: can we memoize full packs? (I guess you'd have to be careful to remove a pack from the set, if you removed anything from it):

tests/test_benchmark.py::test_pack_write
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.4833  0.4833  disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1       0.0484  0.0484  0.4670  0.4670  disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001   0.0180  0.0000  0.2115  0.0000  disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003   0.1005  0.0000  0.1005  0.0000  ~:0(<built-in method posix.stat>)
10002   0.0145  0.0000  0.0799  0.0000  disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id)

~_get_objects_stream_meta_generator yield named tuple or dataclass (see https://pypi.org/project/dataclasses/ and https://github.com/python-attrs/attrs), also for the meta key, this should be a dataclass rather than a dict.~ [GP: moved to #150]
~and actually anywhere you are returning dicts, it would be better to return a dataclass (I'm a big fan of TypeScript!), like count_objects and get_total_size~ [GP: moved to #150]
It feels like _get_objects_stream_meta_generator could be refactored some what to remove the duplication(pytest is right, it is too complex lol). Perhaps the number of retries for loose_not_found should be configurable.
https://github.com/aiidateam/disk-objectstore/blob/e7e3627a67e8de2030b231927a127a9fb06ae474/disk_objectstore/container.py#L991-L992 just use a set 🤷 {hashkey}
change get_hash -> get_hash_cls

Documention notes:

Current experience (with AiiDA) shows that it's actually not so good to use two levels of nesting.

out of interest, why?

the move operation should be hopefully a fast atomic operation on most filesystems

are there any you know of where this is not the case?

add tox.ini:

[tox]
envlist = py37

[testenv]
usedevelop=True

[testenv:py{35,36,37,38}]
description = run pytest
extras = dev
commands = pytest {posargs:--cov=disk_objectstore}

[testenv:py{36,37,38}-pre-commit]
description = run pre-commit
extras = dev
commands = pre-commit {posargs:run}

[testenv:py{35,36,37,38}-ipython]
description = start an ipython console
deps =
    ipython
commands = ipython

Profiling snapshot (for later discussion):

$ tox -e py37 -- --benchmark-only --benchmark-cprofile=cumtime

tests/test_benchmark.py::test_has_objects
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0048  0.0048  0.4209  0.4209  disk-objectstore/disk_objectstore/container.py:778(has_objects)
10001   0.0322  0.0000  0.4123  0.0000  disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
6       0.0000  0.0000  0.0896  0.0149  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3503(__iter__)
6       0.0000  0.0000  0.0886  0.0148  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3528(_execute_and_instances)
8       0.0000  0.0000  0.0874  0.0109  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/base.py:943(execute)
7       0.0000  0.0000  0.0872  0.0125  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/sql/elements.py:296(_execute_on_connection)
7       0.0001  0.0000  0.0872  0.0125  disk-objectstore/.tox/py37/lib/python3.7/site-packages

tests/test_benchmark.py::test_list_all_loose
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.0000  0.0000  ~:0(<method 'disable' of '_lsprof.Profiler' objects>)

tests/test_benchmark.py::test_list_all_packed
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.0000  0.0000  ~:0(<method 'disable' of '_lsprof.Profiler' objects>)

tests/test_benchmark.py::test_loose_read
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0019  0.0019  0.1003  0.1003  disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
1001    0.0046  0.0000  0.0927  0.0001  disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
1000    0.0321  0.0000  0.0321  0.0000  ~:0(<built-in method io.open>)

tests/test_benchmark.py::test_pack_read
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0089  0.0089  0.1700  0.1700  disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
10001   0.0300  0.0000  0.1407  0.0000  disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
10001   0.0146  0.0000  0.0677  0.0000  disk-objectstore/disk_objectstore/utils.py:922(detect_where_sorted)
20004   0.0051  0.0000  0.0498  0.0000  ~:0(<built-in method builtins.next>)
10001   0.0030  0.0000  0.0447  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1006(__iter__)
10001   0.0071  0.0000  0.0417  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1320(fetchone)
20000   0.0099  0.0000  0.0253  0.0000  disk-objectstore/disk_objectstore/utils.py:385(_update_pos)
10000   0.0093  0.0000  0.0245  0.0000  disk-objectstore/disk_objectstore/utils.py:365(__init__)
10001   0.0028  0.0000  0.0237  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1213(_fetchone_impl)
10001   0.0209  0.0000  0.0209  0.0000  ~:0(<method 'fetchone' of 'sqlite3.Cursor' objects>)
10000   0.0059  0.0000  0.0204  0.0000  disk-objectstore/disk_objectstore/utils.py:400(read)
20000   0.0154  0.0000  0.0154  0.0000  ~:0(<method 'tell' of '_io.BufferedReader' objects>)
10000   0.0080  0.0000  0.0109  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages

tests/test_benchmark.py::test_loose_write
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0018  0.0018  0.5730  0.5730  disk-objectstore/tests/test_benchmark.py:28(write_loose)
1000    0.0039  0.0000  0.5710  0.0006  disk-objectstore/disk_objectstore/container.py:842(add_object)
1000    0.0051  0.0000  0.5671  0.0006  disk-objectstore/disk_objectstore/container.py:851(add_streamed_object)
1000    0.0150  0.0000  0.4302  0.0004  disk-objectstore/disk_objectstore/utils.py:158(__exit__)
1000    0.0079  0.0000  0.2387  0.0002  disk-objectstore/disk_objectstore/utils.py:852(safe_flush_to_disk)
2000    0.0052  0.0000  0.1639  0.0001  disk-objectstore/disk_objectstore/utils.py:870(<lambda>)
2000    0.1587  0.0001  0.1587  0.0001  ~:0(<built-in method posix.fsync>)
1000    0.0051  0.0000  0.1083  0.0001  disk-objectstore/disk_objectstore/utils.py:141(__enter__)
2000    0.1044  0.0001  0.1044  0.0001  ~:0(<built-in method io.open>)

tests/test_benchmark.py::test_pack_write
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.4961  0.4961  disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1       0.0522  0.0522  0.4945  0.4945  disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001   0.0193  0.0000  0.2252  0.0000  disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003   0.1072  0.0000  0.1072  0.0000  ~:0(<built-in method posix.stat>)
10002   0.0150  0.0000  0.0848  0.0000  disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id)

Originally posted by @chrisjsewell in https://github.com/aiidateam/disk-objectstore/pull/96#issuecomment-700491481

aiidateam / disk-objectstore

General, non-essential improvements #101