Efficiency issue when looping over many objects

giovannipizzi commented 4 years ago

The function get_objects_stream_and_meta was supposed to iterate by order of offset but after some refactoring it's not doing it anymore.

Here is a stub of a diff to show what needs to be changed. This is very important for efficiency! (the tqdm progress bar can of course be ignored)

@@ -464,15 +465,19 @@ class Container:  # pylint: disable=too-many-public-methods

         # Operate in chunks, due to the SQLite limits
         # (see comment above the definition of self._IN_SQL_MAX_LENGTH)
-        for chunk in chunk_iterator(hashkeys_set, size=self._IN_SQL_MAX_LENGTH):
+        for chunk in tqdm.tqdm(chunk_iterator(hashkeys_set, size=self._IN_SQL_MAX_LENGTH), desc='chunks'):
             query = session.query(Obj).filter(
                 Obj.hashkey.in_(chunk)
             ).with_entities(Obj.pack_id, Obj.hashkey, Obj.offset, Obj.length, Obj.compressed,
-                            Obj.size).order_by(Obj.offset)
+                            Obj.size)
             for res in query:
                 packs[res[0]].append(ObjQueryResults(res[1], res[2], res[3], res[4], res[5]))

-        for pack_int_id, pack_metadata in packs.items():
+        for pack_int_id in sorted(packs):
+            pack_metadata = packs[pack_int_id]
+            # Sort in place
+            pack_metadata.sort(key=lambda meta: meta.offset)
+

In practice since we ask in chunks, it does not make sense to order in the query, but we can order in memory later.

giovannipizzi commented 4 years ago

The second aspect that required attention and that I also addressed in 59ba6e2 was that committing to the DB is expensive, so we need to commit in bulk. I report here some results.

I used the following script to test the main logic of exporting: test-bulk-sql.py.txt

This includes the following aspects:

minimal implementation of key routines
only checks pack #0 to check the speed on a subset. This was done on the SDB database (totalling >6 million nodes)
creates a session on a second DB to recreate/export
gets all objects in the DB
loops over them with a simple caching, and flushes every MAX_CACHE_SIZE bytes
also flushes the DB at that point (if do_db_flush_with_data_flush is True) otherwise only when the number of objects still to be committed is larger than OBJECT_CHUNK_SIZE_COMMITS (set t o 100'000)
also copies, in an efficient way, data from one pack to another new pack file (in practice, compacting the data) and also hashing
performs some final queries to check the size.

Here are some results.

`do_db_flush_with_data_flush = False`

STARTING N=None ONLY_PACK_ZERO=True copy_also_data = True
DELETE-ALL 0.9332137107849121
[(0,)]
COUNT DESTINATION EMPTY 0.0012023448944091797
205454it [00:02, 91045.07it/s]
LISTALL 2.2659332752227783 205454
IN-MEM-SORT 0.032373905181884766 205454
 48%|████████████████████████████████████▏                                       | 97747/205454 [00:05<00:06, 16013.14it/s]
DB FLUSH AT CNT=102418, LEN_MAPPINGS=99999
 96%|████████████████████████████████████████████████████████████████████████▏  | 197656/205454 [00:14<00:00, 16222.83it/s]
DB FLUSH AT CNT=203023, LEN_MAPPINGS=99999
100%|███████████████████████████████████████████████████████████████████████████| 205454/205454 [00:16<00:00, 12223.93it/s]
LAST-ROUND FINAL DB FLUSH AT CNT=205454
BULK INSERT (205454) [with data copy] 19.18240475654602
READ AND WRITE FILES CLOSED.
[(205454,)]
COUNT NEW 0.21904659271240234
[(4178255084,)]
TOTAL LENGTH OLD PACK 0 ONLY 1.093876600265503
[(4178255084,)]
TOTAL LENGTH NEW 0.044422149658203125

`do_db_flush_with_data_flush = True`

This is slower because there are many commits to the DB. This is with a buffer size of 100MB.

STARTING N=None ONLY_PACK_ZERO=True copy_also_data=True
DELETE-ALL 0.918830156326294
[(0,)]
COUNT DESTINATION EMPTY 0.0011935234069824219
205454it [00:02, 91399.03it/s]
LISTALL 2.2576727867126465 205454
IN-MEM-SORT 0.032414913177490234 205454
  3%|██▍                                                                          | 6634/205454 [00:00<00:02, 66322.67it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=7250, LEN_MAPPINGS=7249
  4%|███▎                                                                          | 8624/205454 [00:01<00:43, 4533.11it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=12794, LEN_MAPPINGS=5544
  6%|████▊                                                                        | 12794/205454 [00:03<00:54, 3514.10it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=16544, LEN_MAPPINGS=3750
  8%|██████▏                                                                      | 16544/205454 [00:05<01:04, 2920.10it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=20397, LEN_MAPPINGS=3853
 13%|█████████▉                                                                   | 26670/205454 [00:07<00:52, 3403.10it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=26756, LEN_MAPPINGS=6359
 14%|██████████▊                                                                  | 28988/205454 [00:09<01:28, 1998.94it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=31933, LEN_MAPPINGS=5177
 16%|███████████▉                                                                 | 31933/205454 [00:12<01:48, 1599.02it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=37259, LEN_MAPPINGS=5326
 18%|█████████████▉                                                               | 37259/205454 [00:14<01:35, 1753.48it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=42244, LEN_MAPPINGS=4985
 21%|███████████████▊                                                             | 42244/205454 [00:17<01:28, 1839.15it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=46025, LEN_MAPPINGS=3781
 22%|█████████████████▏                                                           | 46025/205454 [00:19<01:32, 1727.00it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=49406, LEN_MAPPINGS=3381
 24%|██████████████████▌                                                          | 49406/205454 [00:21<01:34, 1654.58it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=55446, LEN_MAPPINGS=6040
 27%|████████████████████▊                                                        | 55446/205454 [00:24<01:23, 1805.23it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=61525, LEN_MAPPINGS=6079
 30%|███████████████████████                                                      | 61525/205454 [00:26<01:12, 1973.14it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=65668, LEN_MAPPINGS=4143
 32%|████████████████████████▌                                                    | 65668/205454 [00:29<01:16, 1836.91it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=70847, LEN_MAPPINGS=5179
 34%|██████████████████████████▌                                                  | 70847/205454 [00:32<01:11, 1878.41it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=76858, LEN_MAPPINGS=6011
 37%|████████████████████████████▊                                                | 76858/205454 [00:38<01:28, 1456.62it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=82353, LEN_MAPPINGS=5495
 40%|██████████████████████████████▊                                              | 82353/205454 [00:46<01:55, 1064.99it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=88462, LEN_MAPPINGS=6109
 43%|█████████████████████████████████▏                                           | 88462/205454 [00:49<01:31, 1281.64it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=93698, LEN_MAPPINGS=5236
 46%|███████████████████████████████████                                          | 93698/205454 [00:52<01:18, 1418.52it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=97747, LEN_MAPPINGS=4049
 48%|████████████████████████████████████▋                                        | 97747/205454 [00:54<01:12, 1484.16it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=102418, LEN_MAPPINGS=4671
 50%|█████████████████████████████████████▉                                      | 102418/205454 [00:57<01:07, 1535.14it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=105393, LEN_MAPPINGS=2975
 51%|██████████████████████████████████████▉                                     | 105393/205454 [00:59<01:10, 1416.55it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=109491, LEN_MAPPINGS=4098
 53%|████████████████████████████████████████▌                                   | 109491/205454 [01:02<01:05, 1463.87it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=115726, LEN_MAPPINGS=6235
 56%|██████████████████████████████████████████▊                                 | 115726/205454 [01:04<00:53, 1684.04it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=119802, LEN_MAPPINGS=4076
 58%|████████████████████████████████████████████▎                               | 119802/205454 [01:07<00:51, 1662.35it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=124561, LEN_MAPPINGS=4759
 61%|██████████████████████████████████████████████                              | 124561/205454 [01:09<00:47, 1706.30it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=130085, LEN_MAPPINGS=5524
 63%|████████████████████████████████████████████████                            | 130085/205454 [01:12<00:42, 1757.44it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=135006, LEN_MAPPINGS=4921
 66%|█████████████████████████████████████████████████▉                          | 135006/205454 [01:15<00:38, 1807.19it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=140431, LEN_MAPPINGS=5425
 68%|███████████████████████████████████████████████████▉                        | 140431/205454 [01:17<00:34, 1907.63it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=143989, LEN_MAPPINGS=3558
 70%|█████████████████████████████████████████████████████▉                       | 143989/205454 [01:26<01:08, 897.17it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=149423, LEN_MAPPINGS=5434
 76%|█████████████████████████████████████████████████████████▋                  | 155841/205454 [01:31<00:35, 1379.92it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=156092, LEN_MAPPINGS=6669
 77%|██████████████████████████████████████████████████████████▍                 | 157865/205454 [01:33<00:41, 1160.58it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=162135, LEN_MAPPINGS=6043
 79%|███████████████████████████████████████████████████████████▉                | 162135/205454 [01:36<00:33, 1306.15it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=168161, LEN_MAPPINGS=6026
 82%|██████████████████████████████████████████████████████████████▏             | 168161/205454 [01:38<00:24, 1504.06it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=172383, LEN_MAPPINGS=4222
 84%|███████████████████████████████████████████████████████████████▊            | 172383/205454 [01:41<00:20, 1575.12it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=177671, LEN_MAPPINGS=5288
 86%|█████████████████████████████████████████████████████████████████▋          | 177671/205454 [01:43<00:16, 1725.79it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=181183, LEN_MAPPINGS=3512
 88%|███████████████████████████████████████████████████████████████████         | 181183/205454 [01:45<00:14, 1656.71it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=186742, LEN_MAPPINGS=5559
 91%|█████████████████████████████████████████████████████████████████████       | 186742/205454 [01:48<00:10, 1784.26it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=192126, LEN_MAPPINGS=5384
 94%|███████████████████████████████████████████████████████████████████████     | 192126/205454 [01:51<00:07, 1836.40it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=197656, LEN_MAPPINGS=5530
 96%|█████████████████████████████████████████████████████████████████████████   | 197656/205454 [01:53<00:04, 1885.97it/s]FINAL DB FLUSH DURING FILE FLUSH AT CNT=203023, LEN_MAPPINGS=5367
100%|████████████████████████████████████████████████████████████████████████████| 205454/205454 [01:56<00:00, 1759.87it/s]
LAST-ROUND FINAL DB FLUSH AT CNT=205454
BULK INSERT (205454) [with data copy] 119.28469467163086
READ AND WRITE FILES CLOSED IN 5.061567544937134s.
[(205454,)]
COUNT NEW 1.4819557666778564
[(4178255084,)]
TOTAL LENGTH OLD PACK 0 ONLY 1.125060796737671
[(4178255084,)]
TOTAL LENGTH NEW 0.03821682929992676

`do_db_flush_with_data_flush = True` but with a buffer size of 2GB

To show that the efficiency loss does not come from the different buffer size but from the number of commits, we increase the buffer size so we force a commit to the DB much less often. Here are the results.

STARTING N=None ONLY_PACK_ZERO=True copy_also_data = True
DELETE-ALL 2.141411781311035
[(0,)]
COUNT DESTINATION EMPTY 0.0011713504791259766
205454it [00:02, 93141.28it/s]
LISTALL 2.215979814529419 205454
IN-MEM-SORT 0.03337550163269043 205454
 48%|████████████████████████████████████▌                                       | 98950/205454 [00:01<00:01, 80122.68it/s]DB FLUSH AT CNT=103063, LEN_MAPPINGS=99999
FINAL DB FLUSH DURING FILE FLUSH AT CNT=103063, LEN_MAPPINGS=3063
100%|███████████████████████████████████████████████████████████████████████████| 205454/205454 [00:12<00:00, 16993.75it/s]
LAST-ROUND DB FLUSH AT CNT=205454, LEN_MAPPINGS=99999
LAST-ROUND FINAL DB FLUSH AT CNT=205454
BULK INSERT (205454) [with data copy] 21.377328395843506
READ AND WRITE FILES CLOSED IN 25.771100997924805s.
[(205454,)]
COUNT NEW 0.20121407508850098
[(4178255084,)]
TOTAL LENGTH OLD PACK 0 ONLY 1.1054370403289795
[(4178255084,)]
TOTAL LENGTH NEW 0.035547494888305664

This is much faster. So if we need to do a flush after every disk flush, as it is now in order to keep the interface clean (this is done because the data is passed to the add_streamed_objects_to_pack), then we need to use a large cache to be efficient, if possible.

For reference, here is a simpler version of the script that creates the same output pack file

test-bulk-sql-simple.py.txt

giovannipizzi commented 4 years ago

I also want to report here some benchmarks on copying files going via python, to see the overhead, and also to check the overhead of hashing.

Here is the script. benchmark-copy-file.py.txt

The output on my computer follows on a Ubuntu machine with hardware RAID1 (theospc36 - the name reported in the comment in the file is wrong), on a 4GB file.

Note that the first run is performed right after cleaning the disk cache; the rest exploits instead the disk cache (thus the difference)

4295689889 bytes in 15.879098653793335 s (257.992547795316MB/s); hash: 0893bdd0bf6193499fa748e512acc97fc44d3684ebe4d4f21c25c75eac475423, chunk size (kB): 64.0
4295689889 bytes in 2.0028188228607178 s (2045.4616621456685MB/s); hash: None, chunk size (kB): 64.0
4295689889 bytes in 9.722997426986694 s (421.3401421885331MB/s); hash: 0893bdd0bf6193499fa748e512acc97fc44d3684ebe4d4f21c25c75eac475423, chunk size (kB): 1024.0
4295689889 bytes in 1.9658374786376953 s (2083.9408968966636MB/s); hash: None, chunk size (kB): 1024.0
4295689889 bytes in 9.968029499053955 s (410.982844580679MB/s); hash: 0893bdd0bf6193499fa748e512acc97fc44d3684ebe4d4f21c25c75eac475423, chunk size (kB): 10240.0
4295689889 bytes in 2.258340358734131 s (1814.0264387258417MB/s); hash: None, chunk size (kB): 10240.0

So, using the cache and without hashing, we go at 2GB/s; with hashing, we go down to a still acceptable 410MB/s. The first read without disk cache goes at 250MB/s, so I think hashing time is comparable to reading directly from disk.

These results show that it is efficient to copy via python (in chunks).

giovannipizzi commented 4 years ago

The remaining difficult issue is how to get a large subset of objects from a large DB in an efficient way.

Even with the fixes reported above (bulk commits, read the pack files sorted), see acc2186, the export is still not being performed at full speed and there is optimisation margin.

I am running with a tqdm bar, that however gets 'stuck' every time that we need to add new objects. One part is clearly the writing and commit (that is inevitable). But:

At the very beginning, we call to get_objects_stream_and_meta. This needs to do the following call: https://github.com/aiidateam/disk-objectstore/blob/acc2186dd9e9d4ae2113739d2fbcd279a34b2052/disk_objectstore/container.py#L455-L472 This is very efficient if there is only a few objects called, because thanks to the indices this scales well even if there is a very big DB (very quick call also for 6M entries). However, this call is very slow (~4 minutes for a 6M DB, when asking all nodes) while listing all objects would take only ~20-30s, so there is a threshold on the number of entries in the DB and number of entries requested for which it's faster to retrieve everything and filter in memory (if there is enough memory).
At the beginning of each call to add_streamed_objects_to_pack, there are a number of queries to be done. If we want export not to call any private method on the other repository, we have to see if we can optimise in some way these calls. Maybe we could just get the list of UUIDs on the destination, make a set, and only send the new ones, in this case without using no_holes, that will therefore speed up a lot the procedure.

There might be also other reasons why the procedure of exporting can still be optimised. Some possible reasons, not investigated:

at every call we create a lock on the pack, open it, and at the end close (so, flush) and delete the lock. This is probably also expensive.
the call to fsync might slow down the procedure

Also, a note on timing: when exporting the full DB (6.7M objects), this is the timing (process ongoing):

Objects:  54%|█████████████████████████████████████████████▌                                      | 3641189/6714808 [45:26<10:05, 5074.11it/s]
(it already exported 12 packs, ~54GB).

So it's exporting at a relatively slow 20MB/s, ~1400 objects/s (the larger it/s value it's because it has some peaks and then stops to perform other operations, probably committing, flushing, querying again, ...)

Ar the end, it took:

[...]
Objects: 100%|█████████████████████████████████████████████████████████████████████████████████▉| 6712462/6714808 [Objects: 100%|█████████████████████████████████████████████████████████████████████████████████▉| 6713204/6714808 [Objects: 100%|█████████████████████████████████████████████████████████████████████████████████▉| 6713853/6714808 [Objects: 100%|█████████████████████████████████████████████████████████████████████████████████▉| 6714605/6714808 [Objects: 100%|██████████████████████████████████████████████████████████████████████████████████| 6714808/6714808 [1:50:49<00:00, 1009.83it/s]
List existing: 100%|█████████████████████████████████████████████████| 6586911/6586911 [00:29<00:00, 223153.85it/s]
Bulk storing: 100%|██████████████████████████████████████████████████████| 6586911/127897 [01:14<00:00, 1722.82it/s]
Everything re-exported in 6759.558407306671s.

real    113m4.466s
user    23m57.540s
sys     4m37.980s

It wrote packs until id 23, for a total size of 94GB. So the final combined speed is 14MB/s and ~1000 objects/s.

The last two lines are the output of the callback of the final add_objects_to_pack that is called to flush the cache containing the last 127897 objects. As we can see, every call starts by listing all objects that takes ~30 seconds toward the end (there are 6586911 objects already in the DB to list on, and it's attempting to store the last 127897). At the beginning this cost is smaller, but this shows already one bottleneck.

Also, the last bulk was run at ~1700 objects/s, a bit better but still not ideal.

Some final results to show all objects and all content was rewritten:

In [4]: cold.get_total_size()
Out[4]: 
{'total_size_packed': 100699171587,
 'total_size_packed_on_disk': 100699171587,
 'total_size_packfiles_on_disk': 168824540013,
 'total_size_packindexes_on_disk': 1247086592,
 'total_size_loose': 0}

In [5]: cnew.get_total_size()
Out[5]: 
{'total_size_packed': 100699171587,
 'total_size_packed_on_disk': 100699171587,
 'total_size_packfiles_on_disk': 100699171587,
 'total_size_packindexes_on_disk': 1246941184,
 'total_size_loose': 0}

In [6]: cnew.count_objects()
cOut[6]: {'packed': 6714808, 'loose': 0, 'pack_files': 24}

In [7]: cold.count_objects()
Out[7]: {'packed': 6714808, 'loose': 0, 'pack_files': 40}

So all objects were transferred, and the total_size_packed_on_disk and total_size_packed are the same (note that 100699171587/1024/1024/1024 ~94GB). The DB is essentially the same size (the content is different because the offsets are different so it's normal that they are not exactly identical. And thanks to repacking the total_size_packfiles_on_disk went down from 157GB to 94GB (the rest would be duplicate objects for the way the original packs had been generated).

IMPORTANT NOTE: while the speed should be improved, this already means that it's possible to export and/or repack a ~100GB container with ~6.7M objects in less than 2 hours, that seems already acceptable for production, even if it can be improved.

As a reference, the read (and hashing) speed during validation of the same (repacked) container is reported in this comment and goes at ~10'000 objects/s, and 145MB/s [here we have, beside the python logic that can be improved, anyway an unavoidable additional cost for storing both data on disk and in the DB, but this should give a sense of how much we can hope to improve]

giovannipizzi commented 4 years ago

As a note for the future, I tried to use temporary tables and use intersect, but it does not seem to help (actually it's slower). See e.g. this script:

test-sql-intersect-and-temp-table.py.txt

giovannipizzi commented 4 years ago

To check if #93 helps (e.g. we could try to get, in export, only the new items, and send only those - maybe in this case we need anyway to put them in a list in memory, but at least this is efficient).

giovannipizzi commented 4 years ago

When merging the branch fix_92_... in the fork of giovannipizzi, this will already provide important speed enhancements (even if probably can still be improved) and also fix #12 where the full repack functionality has been implemented (and tested to be efficient). Also, the branch includes a few more callbacks to monitor the speed.

aiidateam / disk-objectstore