aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
437 stars 192 forks source link

Testing (and documenting, and possibly improving) what happens with concurrent archive imports #5389

Open giovannipizzi opened 2 years ago

giovannipizzi commented 2 years ago

One thing we should test, with the new code in develop (future v2.0) is what happens when importing concurrently two archives that share some of the nodes (simple way to test: import twice the same file, at the very same time, in 2 different processes).

I see a few things that can happen:

@chrisjsewell how is the import now working (loose or directly to packs, or both - and then which default)?

This issue is mostly to remember to check this explicitly, documenting what can or cannot be done - and possibly fixing things that can be easily fixed (e.g. already putting the locking mechanism also for imports directly into the packs). I'm not adding this to the 2.0 milestone to avoid to add too much stuff (we can fix this later in 2.1) but if instead you want to check for 2.0, I'd be totally fine

ramirezfranciscof commented 2 years ago

Just one thing that I noticed making some tests. After the migration of the storage, the resulting repository has all files stored into packs but in a "non-optimal way". Meaning that they are occupying unnecessary space that can be freed with maintenance operations.

For example, if I import the "Pyrene-based metal organic frameworks" (link) database in AiiDA v1.6.5 and then change to current develop and migrate the storage, I get the following stats:

(aiida) ~/workdir# verdi storage info --statistics
(...)
repository:
  SHA-hash algorithm: sha256
  Compression algorithm: zlib+1
  Packs: 1
  Objects:
    unpacked: 0
    packed: 26101
  Size (MB):
    unpacked: 0.0
    packed: 1721.5623865127563
    other: 4.25

If I then run verdi storage maintain --full and get the statistics again, I get:

repository:
  SHA-hash algorithm: sha256
  Compression algorithm: zlib+1
  Packs: 1
  Objects:
    unpacked: 0
    packed: 26101
  Size (MB):
    unpacked: 0.0
    packed: 1223.617220878601
    other: 4.03125

Note that the number of packed files is the same, but the space is reduced by 1/3.

I personally think it would probably be nice if the result of doing something directly into packs resulted in the optimal storage into packs, even if it takes a bit of extra time at the end. Or maybe we should give the option to choose. But for now just to mention as a details we may want to consider when dealing with this.

giovannipizzi commented 2 years ago

@ramirezfranciscof if your issue is about the migration, we can just run a maintenance operation right after. But I think it's better to maybe just inform the user, and let them decide when to do (I think we can actually also quickly estimate how much space will be left looking at the statistics from disk-objectstore: i.e. compare the size of packed on disk vs the size of packed objects).

If more specifically you are thinking about imports, I think it's tricky: to efficiently import directly into packs, we need to stream directly into the file if we don't want to write each object and then copy it into the pack. So we discover only at the end if there are duplicates.

One reason I decided, at that point if you discover that you already have the object, to not go back and overwrite with the next object to write, is that if you are importing a lot of small objects and you have them all already (very possible, e.g. you reimport or import something with similar files) you end up potentially writing a lot of times on the very same hard disk sectors (probably with caching and some cleverness of the OS and the filesystem, you are not really going to write immediately to disk but these writes might be delayed, but one does not know). And I think this can really risk to wear out and destroy physically some disks, especially SSDs... So better to keep writing, and then "fix" the problem in one go at the end, with a maintenance operation.

ramirezfranciscof commented 2 years ago

@ramirezfranciscof if your issue is about the migration, we can just run a maintenance operation right after. But I think it's better to maybe just inform the user, and let them decide when to do (I think we can actually also quickly estimate how much space will be left looking at the statistics from disk-objectstore: i.e. compare the size of packed on disk vs the size of packed objects).

If more specifically you are thinking about imports, I think it's tricky: to efficiently import directly into packs, we need to stream directly into the file if we don't want to write each object and then copy it into the pack. So we discover only at the end if there are duplicates.

Well, I'm talking about both actually, but it is true that it is more "noticeable" for the migration: since you are actually re-generating the whole database, as a user I would be a bit surprised if you stopped at a point where it immediately "requires" extra maintenance. A similar thing would happen if I just import into a brand new profile, but it is true that when importing into a profile that is already loaded there should be no expectations that the resulting storage would be in optimal conditions, so in that sense perhaps you are right in just focusing on the migration.

In any case, this is from the point of view of the user, so I have no problem with your suggested solutions of just running the maintenance afterwards automatically or informing the user they should manually do it. My point here was just that the aforementioned expectation is very likely and worth addressing.

chrisjsewell commented 2 years ago

@chrisjsewell how is the import now working (loose or directly to packs

loose

However for efficiency there should be the option to import directly in the packs

@giovannipizzi only if you can work out how to do this in a StorageBackend agnostic manner: aiida/tools/archive/imports.py::import_archive is completely backend agnostic and, in fact, soon the code won't even be specific to the archive, it will simply be a function that transfers the data from one profile to another.

This is the exact code of streaming repository files, from one backend to another:

https://github.com/aiidateam/aiida-core/blob/09765ecbd8280b629da5a804bdaa6353fc0a3179/aiida/tools/archive/imports.py#L1098-L1112

i.e. absolutely no reference to the disk-objectstore

giovannipizzi commented 2 years ago

Here is my suggestion on how to add this feature in an agnostic way.

Note: to be implemented only if we have use cases where there is a huge (10x or more) speed improvement - to be tested on real archives, but with 100,000 or more objects, otherwise one does not see any difference. Or we can close this issue if we report here some speed results showing that it's only a marginal improvement.

Adding @eimrek to the conversation since we've been discussing with him. Actually, for a quick performance test, I think you can quickly hack temporarily/locally the code here: https://github.com/aiidateam/aiida-core/blob/96a4a6b632306bed3d96bb4596536ce9b06c3364/aiida/tools/archive/imports.py#L1106-L1112 and call directly the disk-objectstore method to put directly into packs (similar to the change above) to check performance of a few big repos.

giovannipizzi commented 11 months ago

An update after discussing with @mbercx of what we should do