Open alexwlchan opened 4 years ago
There's a proof of concept here: https://github.com/wellcometrust/archivematica/commit/8ff5af0e56daa9bcb6f96511c3207b9fdc8abb97. It still needs tests, comments, and probably to be applied to transferfiles as well (although I’m not sure when that index gets populated).
I had an AIP with 4381 files to index.
Without that patch (indexing one-by-one), it took 4862 seconds (81 minutes) With that patch (indexing in bulk), it took 42 seconds
Hi @alexwlchan, thank you. We made similar changes as part of https://github.com/artefactual/archivematica/pull/1463 but the commits in it didn't seem to make it to your fork. I'm pretty sure that they are part of Archivematica 1.10. I'd be curious to hear your thoughts, whether they're equivalent or there is still some room for improvement.
Sorry, one correction! Our work in https://github.com/artefactual/archivematica/pull/1463 has not been included into a public release yet. It should be part of Archivematica 1.11.
Ah, I think we’ve pulled in changes from Archivematica 1.10, but not the development branch, which is why we haven’t seen this. Great minds think alike, eh? 😅
I know Helen is planning to bring in newer changes to our fork (and in general I want to start reducing the gap between the two) – I’ll have a look at #1463 and see how that compares to what I’ve done.
Please describe the problem you'd like to be solved The "Index AIP" step is quite slow when you have a lot of files. I have a transfer with 4381 files (16000 files in the AIP with the Archivematica logs), and it's taken 81 minutes just on this step, and it’s not done yet. I'd like it to be faster!
Describe the solution you'd like to see implemented AFAICT, Archivematica is sending a single index request to Elasticsearch for every document it wants to store (i.e., every file in an AIP):
https://github.com/artefactual/archivematica/blob/862c4cdd780f1a235f20716d8472d34d9b56094f/src/archivematicaCommon/lib/elasticSearchFunctions.py#L770-L789
I suspect this is a major source of the slowness.
By using the Elasticsearch bulk APIs, I think there'd be a significant speedup in this step – it means less network requests/traffic, and Elasticsearch can handle bulk ingests very efficiently.
See: https://elasticsearch-py.readthedocs.io/en/master/helpers.html#bulk-helpers
Describe alternatives you've considered None.
Additional context We went through this problem in a different context at Wellcome a year or so back – we were indexing a large number of documents into Elasticsearch one at a time, and started to hit speed issues. Switching to bulk indexing made the ingests more reliable and faster.
For Artefactual use:
Before you close this issue, you must check off the following: