Closed chrisjsewell closed 1 year ago
It feels like _get_objects_stream_meta_generator could be refactored some what to remove the duplication(pytest is right, it is too complex lol). Perhaps the number of retries for loose_not_found should be configurable.
It is. But the number of retries is not random - its current number is designed to guarantee it works correctly when there is a concurrent pack_all_loose
and clean_storage
.
Current experience (with AiiDA) shows that it's actually not so good to use two levels of nesting. out of interest, why?
The issue is the number of inodes. Every folder is an inode, and on "typical" filesystems (e.g. ext3, ext4) it costs 4kB or so. Let's consider two levels of nesting: you have 256 folders per level, so a total of 256*256 = 65536 folders).
You indeed have on average less files per sub-sub-folder (e.g. if you have 1 million objects, you have on average ~4 files per folder. With a single nesting level, you have ~4000. But in the end, having 4000 files in a folder is not so big of a problem! No need to only keep 3-4 files per subfolder. And with disk-objectstore, the idea is that you should have probably already packed your objects if you have 1000000 objects and you should not keep them loose. So, no need to do two levels of nesting. (reading the comment now, I realise that I should not have said that it's not so good, probably it's just not needed).
the move operation should be hopefully a fast atomic operation on most filesystems are there any you know of where this is not the case?
I didn't check but I'm sure that for non-standard filesystems (especially network shares or similar, or even worse when mounting things that are not supposed to be a filesystem, such as mounting an S3 share) this is not the case. Probably good network filesystems give some good guarantees on this (but I'm not sure if they also provide instantaneous consistency between different network machines), but I'm quite sure that it's not even possible if you mount an S3, since there is no concept (IIRC) of "moving" an object in S3, you have to create a new one and delete the old one.
Corner cases, but we already know of people that would like to use this on top of e.g. S3 (we even had an issue to discuss this, #17) so it's important to specify in my opinion
This is a list of desired, but non-essential improvements noted whilst reviewing https://github.com/aiidateam/disk-objectstore/pull/96
container.clean_storage()
in basic usagecontainer.clean_storage()
~ [GP's comment: I don't think it's useful, it's only very few folders that will be soon recreated]*
before key-word arguments~
_get_objects_stream_meta_generator
yield named tuple or dataclass (see https://pypi.org/project/dataclasses/ and https://github.com/python-attrs/attrs), also for themeta
key, this should be a dataclass rather than a dict.~ [GP: moved to #150]~and actually anywhere you are returning dicts, it would be better to return a dataclass (I'm a big fan of TypeScript!), like
count_objects
andget_total_size
~ [GP: moved to #150]It feels like
_get_objects_stream_meta_generator
could be refactored some what to remove the duplication(pytest is right, it is too complex lol). Perhaps the number of retries forloose_not_found
should be configurable.https://github.com/aiidateam/disk-objectstore/blob/e7e3627a67e8de2030b231927a127a9fb06ae474/disk_objectstore/container.py#L991-L992 just use a set 🤷
{hashkey}
change
get_hash
->get_hash_cls
Documention notes:
out of interest, why?
are there any you know of where this is not the case?
add tox.ini:
Profiling snapshot (for later discussion):
Originally posted by @chrisjsewell in https://github.com/aiidateam/disk-objectstore/pull/96#issuecomment-700491481