Testing (and documenting, and possibly improving) what happens with concurrent archive imports

One thing we should test, with the new code in develop (future v2.0) is what happens when importing concurrently two archives that share some of the nodes (simple way to test: import twice the same file, at the very same time, in 2 different processes).

I see a few things that can happen:

at the node/DB level: both try to store the same node in different transactions, one of the two fails at the end of the import process (but maybe one could detect if the two are importing the same thing and avoid to fail?)
at the disk-objectstore level: if importing as loose objects the disk-objectstore should take care of this and there should be no problem. However for efficiency there should be the option to import directly in the packs. Then, I think we need to use the locking mechanism that @ramirezfranciscof is implementing (@ramirezfranciscof can you take a note?).

@chrisjsewell how is the import now working (loose or directly to packs, or both - and then which default)?

This issue is mostly to remember to check this explicitly, documenting what can or cannot be done - and possibly fixing things that can be easily fixed (e.g. already putting the locking mechanism also for imports directly into the packs). I'm not adding this to the 2.0 milestone to avoid to add too much stuff (we can fix this later in 2.1) but if instead you want to check for 2.0, I'd be totally fine

Just one thing that I noticed making some tests. After the migration of the storage, the resulting repository has all files stored into packs but in a "non-optimal way". Meaning that they are occupying unnecessary space that can be freed with maintenance operations.

For example, if I import the "Pyrene-based metal organic frameworks" (link) database in AiiDA v1.6.5 and then change to current develop and migrate the storage, I get the following stats:

(aiida) ~/workdir# verdi storage info --statistics
(...)
repository:
  SHA-hash algorithm: sha256
  Compression algorithm: zlib+1
  Packs: 1
  Objects:
    unpacked: 0
    packed: 26101
  Size (MB):
    unpacked: 0.0
    packed: 1721.5623865127563
    other: 4.25

If I then run verdi storage maintain --full and get the statistics again, I get:

repository:
  SHA-hash algorithm: sha256
  Compression algorithm: zlib+1
  Packs: 1
  Objects:
    unpacked: 0
    packed: 26101
  Size (MB):
    unpacked: 0.0
    packed: 1223.617220878601
    other: 4.03125

Note that the number of packed files is the same, but the space is reduced by 1/3.

I personally think it would probably be nice if the result of doing something directly into packs resulted in the optimal storage into packs, even if it takes a bit of extra time at the end. Or maybe we should give the option to choose. But for now just to mention as a details we may want to consider when dealing with this.

@ramirezfranciscof if your issue is about the migration, we can just run a maintenance operation right after. But I think it's better to maybe just inform the user, and let them decide when to do (I think we can actually also quickly estimate how much space will be left looking at the statistics from disk-objectstore: i.e. compare the size of packed on disk vs the size of packed objects).

If more specifically you are thinking about imports, I think it's tricky: to efficiently import directly into packs, we need to stream directly into the file if we don't want to write each object and then copy it into the pack. So we discover only at the end if there are duplicates.

One reason I decided, at that point if you discover that you already have the object, to not go back and overwrite with the next object to write, is that if you are importing a lot of small objects and you have them all already (very possible, e.g. you reimport or import something with similar files) you end up potentially writing a lot of times on the very same hard disk sectors (probably with caching and some cleverness of the OS and the filesystem, you are not really going to write immediately to disk but these writes might be delayed, but one does not know). And I think this can really risk to wear out and destroy physically some disks, especially SSDs... So better to keep writing, and then "fix" the problem in one go at the end, with a maintenance operation.

@ramirezfranciscof if your issue is about the migration, we can just run a maintenance operation right after. But I think it's better to maybe just inform the user, and let them decide when to do (I think we can actually also quickly estimate how much space will be left looking at the statistics from disk-objectstore: i.e. compare the size of packed on disk vs the size of packed objects).

If more specifically you are thinking about imports, I think it's tricky: to efficiently import directly into packs, we need to stream directly into the file if we don't want to write each object and then copy it into the pack. So we discover only at the end if there are duplicates.

Well, I'm talking about both actually, but it is true that it is more "noticeable" for the migration: since you are actually re-generating the whole database, as a user I would be a bit surprised if you stopped at a point where it immediately "requires" extra maintenance. A similar thing would happen if I just import into a brand new profile, but it is true that when importing into a profile that is already loaded there should be no expectations that the resulting storage would be in optimal conditions, so in that sense perhaps you are right in just focusing on the migration.

In any case, this is from the point of view of the user, so I have no problem with your suggested solutions of just running the maintenance afterwards automatically or informing the user they should manually do it. My point here was just that the aforementioned expectation is very likely and worth addressing.

@chrisjsewell how is the import now working (loose or directly to packs

loose

However for efficiency there should be the option to import directly in the packs

@giovannipizzi only if you can work out how to do this in a StorageBackend agnostic manner: aiida/tools/archive/imports.py::import_archive is completely backend agnostic and, in fact, soon the code won't even be specific to the archive, it will simply be a function that transfers the data from one profile to another.

This is the exact code of streaming repository files, from one backend to another:

https://github.com/aiidateam/aiida-core/blob/09765ecbd8280b629da5a804bdaa6353fc0a3179/aiida/tools/archive/imports.py#L1098-L1112

i.e. absolutely no reference to the disk-objectstore

Here is my suggestion on how to add this feature in an agnostic way.

in import_archive(), add a parameter exclusive_mode: bool = False with doctstring similar to: "if True, require exclusive access for importing. This, for some backends, might allow to have a much higher performance - e.g. importing directly into packs for the disk-objectstore backend".
pass down this option to _add_files_to_repo
In AbstractRepositoryBackend, beside the existing put_object_from_filelike, add a put_objects_from_filelikes_exclusive method, that get a list of streams, mirroring more or less the api of add_streamed_objects_to_pack in disk-objectstore (with some little changes, see below).
- probably in here, add some guards at the beginning of the function to acquire a lock, similar to what it's done for a full maintenance. Or, if this is not the right place, put the guards in the caller routine, but document in the docstring of the new put_objects_from_filelikes_exclusive, that it REQUIRES exclusive access but that this must be guaranteed by the caller routine
- In the abstract implementation, instead of raising NotImplementedError, I suggest we could just fall back to a loop of self.put_object_from_filelike for those backends that don't provide a fast option (I don't think this should be an error - it will just be done with exclusive access and not be fast, but that's not a problem by itself). Fast backends will re-implement this.

In _add_files_to_repo, if the newly passed option exclusive is True (see above), instead of the current loop that looks like this:

    for key, handle in repository_from.iter_object_streams(new_keys):
        backend_key = repository_to.put_object_from_filelike(handle)
        if backend_key != key:
            raise ImportValidationError(
                f'Archive repository key is different to backend key: {key!r} != {backend_key!r}'
            )

have something like this:

if exclusive:
    # Possibly the guard (lock) needs to be put here? Or inside the method below? See discussion above
    new_backend_keys = repository_to.put_objects_from_filelikes_exclusive(repository_from.iter_object_streams(new_keys), assert_keys=True)
    # See comments below on the actual implementation of put_objects_from_filelikes_exclusive
else:
  #code above

A final note on the signature of put_objects_from_filelikes_exclusive, where it would be different from the disk-objectstore's API of add_streamed_objects_to_pack: since repository_from.iter_object_streams(new_keys) does not only yield only the object streams, but pairs (key, object_stream), I think also put_objects_from_filelikes_exclusive should accept a generator of pairs, so the call is very straightforward as shown above. If assert_keys is False, the first element of each pair is just ignored (so you can pass a generator yeilding (None, stream) for instance, if you don't care). If it is True, internally the method will store a list of the keys it encounters while iterating; at the end, it will also get the list of keys actually returned by the implementation (e.g. by add_streamed_objects_to_pack for the disk-objectstore), and assert the equality of the two lists.

Note: to be implemented only if we have use cases where there is a huge (10x or more) speed improvement - to be tested on real archives, but with 100,000 or more objects, otherwise one does not see any difference. Or we can close this issue if we report here some speed results showing that it's only a marginal improvement.

Adding @eimrek to the conversation since we've been discussing with him. Actually, for a quick performance test, I think you can quickly hack temporarily/locally the code here: https://github.com/aiidateam/aiida-core/blob/96a4a6b632306bed3d96bb4596536ce9b06c3364/aiida/tools/archive/imports.py#L1106-L1112 and call directly the disk-objectstore method to put directly into packs (similar to the change above) to check performance of a few big repos.

An update after discussing with @mbercx of what we should do

quickly test that indeed importing directly into packs is faster (for a archive with many small objects in the repo)
implement it so there is a verdi archive import option that allows a "fast but blocking" option (when available in the specific backend, otherwise just warn and fall back to the default). This, for the disk-objectstore, means importing directly into packs, but this should use the locking mechanism not only to check that no-one else is using the profile, but also to avoid others start using it while the process in running
I think it would then be nice to have a new flat to verdi quickstep and similar commands, e.g. --populate-from= or --import-from=, that as soon as the profile is ready, starts populating it from an archive file - this will internally call back to the fast but blocking version of the import (anyway you are just creating it, no-one else is using it) and allows to get started with AiiDA with just 1 line (profile creation + data import).

aiidateam / aiida-core

Testing (and documenting, and possibly improving) what happens with concurrent archive imports #5389