Closed giovannipizzi closed 3 years ago
Note that not putting the results into sets only saves little time (~18s instead of 20s), most of the time comes from getting the whole data from the DB.
Note: asking back from the DB not the whole hash key but only the first few characters (say 8), while it will definitely save memory, it does not help in speeding up the data retrieval, it only changes from ~13.0 to 12.5s.
t = time.time()
count_read = list(session_read.execute("SELECT count(*) FROM db_object"))[0][0]
count_write = list(session_write.execute("SELECT count(*) FROM db_objblob"))[0][0]
print(f"COUNT BOTH read={count_read} write={count_write}", time.time() - t)
t = time.time()
all_uuids = list(tqdm.tqdm(session_write.execute("SELECT hashkey FROM db_objblob ORDER BY hashkey"), to\
tal=count_write))
print("LIST ALL {} NEW UUIDS in".format(len(all_uuids)), time.time() - t)
NUMCHAR = 8
t = time.time()
all_uuids = list(tqdm.tqdm(session_read.execute(
"SELECT substr(hashkey,1,{}) FROM db_object ORDER BY hashkey".format(NUMCHAR)), total=count_read))
print("LIST ALL {} OLD UUIDS (ONLY FIRST {} CHARS) in".format(len(all_uuids), NUMCHAR), time.time() - t\
)
print(all_uuids[:10])
assert all(len(uuid[0]) == NUMCHAR for uuid in all_uuids)
Output:
COUNT BOTH read=6714808 write=205454 0.8465695381164551
100%|███████████████████████████████| 205454/205454 [00:01<00:00, 204535.22it/s]
LIST ALL 205454 NEW UUIDS in 1.0100953578948975
100%|█████████████████████████████| 6714808/6714808 [00:12<00:00, 535991.18it/s]
LIST ALL 6714808 OLD UUIDS (ONLY FIRST 8 CHARS) in 12.544926643371582
[('0000000b',), ('0000029e',), ('00000571',), ('000006af',), ('00000815',), ('00000a24',), ('00000c2a',), ('00000c7e',), ('00000ec3',), ('0000113c',)]
100%|█████████████████████████████| 6714808/6714808 [00:12<00:00, 529509.90it/s]
LIST ALL 6714808 OLD UUIDS in 13.139374256134033
[('0000000b2775f652d71b1ec66477627d81c38ec65b4572f7af3de6fe103c4cab',), ('0000029ea9ed78cfa80c9da7f982657d6c4d85fa8646f0413a15c410462f7973',), ('00000571f966ac015fec7704982dbe9d43c390b1e532fada16f15a663f454cc2',), ('000006af96834385912af6cc93abdfa779d1da9173446c70cd56ca9ca17df8a0',), ('000008154aa9885b1b0e4de235c0b4232a0deb0107fdefd58281b8af0cfe0009',), ('00000a24c0fe196d892dc8c0a9930f81093d39902bc41e21266cb6ec8d001549',), ('00000c2a1a3c980a7ff90061abe63f31bc0e16bc18b815962eea151711aa3eac',), ('00000c7e4f1332f601c53ceaddb321d5505865756a591950da3ab683114cbdf8',), ('00000ec3686d0f887831830e15ce2ffe3d64bbb57fa071e2a0bc9afb196a8b15',), ('0000113c539a849568ac8eed03f261a5b8c74c3fdb093f17bc9c61d33907635e',)]
When comparing two containers to decide what to send on the other side, it becomes important to be able to check what is (or is not) already on the destination.
The following code can check the content of two (sorted and unique) iterators and return who has the item, iterating only once on both in alternation.
Also, this shows how
session.execute()
returns a true iterator without pre-loading everything in memory.Tasks:
session.execute
instead of going via the ORM, for efficiency (e.g. inlist_all_objects
, see #69)detect_where
(loose_left, loose_right), that are also checked (these might be sorted in memory to facilitate the work, and probably one can use recursion:where
merge_sorted_iterators
is a similar (but simpler) function that just iterates on both (assuming they are sorted) and yields the merged sorted list. It makes the logic more convoluted, but we don't put any limitation. Also, this requires only to keep in memory loose objects, something we do already, and that should be ok (if there are too many, we should ask the user to pack anyway).Here is the function, and below the output (IMPORTANT: THE FUNCTION BELOW HAS A COUPLE OF BUGS, A CORRECT IMPLEMENTATION HAS BEEN PUT IN THE CODE IN 38471b6):
OUTPUT (run on a DB of 6.7M nodes, and a subset of it with ~200k nodes):