Open giovannipizzi opened 2 years ago
Just one thing that I noticed making some tests. After the migration of the storage, the resulting repository has all files stored into packs but in a "non-optimal way". Meaning that they are occupying unnecessary space that can be freed with maintenance operations.
For example, if I import the "Pyrene-based metal organic frameworks" (link) database in AiiDA v1.6.5 and then change to current develop and migrate the storage, I get the following stats:
(aiida) ~/workdir# verdi storage info --statistics
(...)
repository:
SHA-hash algorithm: sha256
Compression algorithm: zlib+1
Packs: 1
Objects:
unpacked: 0
packed: 26101
Size (MB):
unpacked: 0.0
packed: 1721.5623865127563
other: 4.25
If I then run verdi storage maintain --full
and get the statistics again, I get:
repository:
SHA-hash algorithm: sha256
Compression algorithm: zlib+1
Packs: 1
Objects:
unpacked: 0
packed: 26101
Size (MB):
unpacked: 0.0
packed: 1223.617220878601
other: 4.03125
Note that the number of packed files is the same, but the space is reduced by 1/3.
I personally think it would probably be nice if the result of doing something directly into packs resulted in the optimal storage into packs, even if it takes a bit of extra time at the end. Or maybe we should give the option to choose. But for now just to mention as a details we may want to consider when dealing with this.
@ramirezfranciscof if your issue is about the migration, we can just run a maintenance operation right after. But I think it's better to maybe just inform the user, and let them decide when to do (I think we can actually also quickly estimate how much space will be left looking at the statistics from disk-objectstore: i.e. compare the size of packed on disk vs the size of packed objects).
If more specifically you are thinking about imports, I think it's tricky: to efficiently import directly into packs, we need to stream directly into the file if we don't want to write each object and then copy it into the pack. So we discover only at the end if there are duplicates.
One reason I decided, at that point if you discover that you already have the object, to not go back and overwrite with the next object to write, is that if you are importing a lot of small objects and you have them all already (very possible, e.g. you reimport or import something with similar files) you end up potentially writing a lot of times on the very same hard disk sectors (probably with caching and some cleverness of the OS and the filesystem, you are not really going to write immediately to disk but these writes might be delayed, but one does not know). And I think this can really risk to wear out and destroy physically some disks, especially SSDs... So better to keep writing, and then "fix" the problem in one go at the end, with a maintenance operation.
@ramirezfranciscof if your issue is about the migration, we can just run a maintenance operation right after. But I think it's better to maybe just inform the user, and let them decide when to do (I think we can actually also quickly estimate how much space will be left looking at the statistics from disk-objectstore: i.e. compare the size of packed on disk vs the size of packed objects).
If more specifically you are thinking about imports, I think it's tricky: to efficiently import directly into packs, we need to stream directly into the file if we don't want to write each object and then copy it into the pack. So we discover only at the end if there are duplicates.
Well, I'm talking about both actually, but it is true that it is more "noticeable" for the migration: since you are actually re-generating the whole database, as a user I would be a bit surprised if you stopped at a point where it immediately "requires" extra maintenance. A similar thing would happen if I just import into a brand new profile, but it is true that when importing into a profile that is already loaded there should be no expectations that the resulting storage would be in optimal conditions, so in that sense perhaps you are right in just focusing on the migration.
In any case, this is from the point of view of the user, so I have no problem with your suggested solutions of just running the maintenance afterwards automatically or informing the user they should manually do it. My point here was just that the aforementioned expectation is very likely and worth addressing.
@chrisjsewell how is the import now working (loose or directly to packs
loose
However for efficiency there should be the option to import directly in the packs
@giovannipizzi only if you can work out how to do this in a StorageBackend
agnostic manner:
aiida/tools/archive/imports.py::import_archive
is completely backend agnostic and, in fact, soon the code won't even be specific to the archive, it will simply be a function that transfers the data from one profile to another.
This is the exact code of streaming repository files, from one backend to another:
i.e. absolutely no reference to the disk-objectstore
Here is my suggestion on how to add this feature in an agnostic way.
import_archive()
, add a parameter exclusive_mode: bool = False
with doctstring similar to: "if True, require exclusive access for importing. This, for some backends, might allow to have a much higher performance - e.g. importing directly into packs for the disk-objectstore backend"._add_files_to_repo
AbstractRepositoryBackend
, beside the existing put_object_from_filelike
, add a put_objects_from_filelikes_exclusive
method, that get a list of streams, mirroring more or less the api of add_streamed_objects_to_pack
in disk-objectstore (with some little changes, see below).
put_objects_from_filelikes_exclusive
, that it REQUIRES exclusive access but that this must be guaranteed by the caller routineself.put_object_from_filelike
for those backends that don't provide a fast option (I don't think this should be an error - it will just be done with exclusive access and not be fast, but that's not a problem by itself). Fast backends will re-implement this._add_files_to_repo
, if the newly passed option exclusive
is True (see above), instead of the current loop that looks like this:
for key, handle in repository_from.iter_object_streams(new_keys):
backend_key = repository_to.put_object_from_filelike(handle)
if backend_key != key:
raise ImportValidationError(
f'Archive repository key is different to backend key: {key!r} != {backend_key!r}'
)
have something like this:
if exclusive:
# Possibly the guard (lock) needs to be put here? Or inside the method below? See discussion above
new_backend_keys = repository_to.put_objects_from_filelikes_exclusive(repository_from.iter_object_streams(new_keys), assert_keys=True)
# See comments below on the actual implementation of put_objects_from_filelikes_exclusive
else:
#code above
put_objects_from_filelikes_exclusive
, where it would be different from the disk-objectstore's API of add_streamed_objects_to_pack
: since repository_from.iter_object_streams(new_keys)
does not only yield only the object streams, but pairs (key, object_stream)
, I think also put_objects_from_filelikes_exclusive
should accept a generator of pairs, so the call is very straightforward as shown above. If assert_keys
is False, the first element of each pair is just ignored (so you can pass a generator yeilding (None, stream)
for instance, if you don't care). If it is True, internally the method will store a list of the keys
it encounters while iterating; at the end, it will also get the list of keys actually returned by the implementation (e.g. by add_streamed_objects_to_pack
for the disk-objectstore), and assert the equality of the two lists.Note: to be implemented only if we have use cases where there is a huge (10x or more) speed improvement - to be tested on real archives, but with 100,000 or more objects, otherwise one does not see any difference. Or we can close this issue if we report here some speed results showing that it's only a marginal improvement.
Adding @eimrek to the conversation since we've been discussing with him. Actually, for a quick performance test, I think you can quickly hack temporarily/locally the code here: https://github.com/aiidateam/aiida-core/blob/96a4a6b632306bed3d96bb4596536ce9b06c3364/aiida/tools/archive/imports.py#L1106-L1112 and call directly the disk-objectstore method to put directly into packs (similar to the change above) to check performance of a few big repos.
An update after discussing with @mbercx of what we should do
verdi archive import
option that allows a "fast but blocking" option (when available in the specific backend, otherwise just warn and fall back to the default). This, for the disk-objectstore, means importing directly into packs, but this should use the locking mechanism not only to check that no-one else is using the profile, but also to avoid others start using it while the process in runningverdi quickstep
and similar commands, e.g. --populate-from=
or --import-from=
, that as soon as the profile is ready, starts populating it from an archive file - this will internally call back to the fast but blocking version of the import (anyway you are just creating it, no-one else is using it) and allows to get started with AiiDA with just 1 line (profile creation + data import).
One thing we should test, with the new code in develop (future v2.0) is what happens when importing concurrently two archives that share some of the nodes (simple way to test: import twice the same file, at the very same time, in 2 different processes).
I see a few things that can happen:
@chrisjsewell how is the import now working (loose or directly to packs, or both - and then which default)?
This issue is mostly to remember to check this explicitly, documenting what can or cannot be done - and possibly fixing things that can be easily fixed (e.g. already putting the locking mechanism also for imports directly into the packs). I'm not adding this to the 2.0 milestone to avoid to add too much stuff (we can fix this later in 2.1) but if instead you want to check for 2.0, I'd be totally fine