Suggestion: Explore using bulk Elasticsearch operations in the "index AIP" step

alexwlchan commented 4 years ago

Please describe the problem you'd like to be solved The "Index AIP" step is quite slow when you have a lot of files. I have a transfer with 4381 files (16000 files in the AIP with the Archivematica logs), and it's taken 81 minutes just on this step, and it’s not done yet. I'd like it to be faster!

Describe the solution you'd like to see implemented AFAICT, Archivematica is sending a single index request to Elasticsearch for every document it wants to store (i.e., every file in an AIP):

https://github.com/artefactual/archivematica/blob/862c4cdd780f1a235f20716d8472d34d9b56094f/src/archivematicaCommon/lib/elasticSearchFunctions.py#L770-L789

I suspect this is a major source of the slowness.

By using the Elasticsearch bulk APIs, I think there'd be a significant speedup in this step – it means less network requests/traffic, and Elasticsearch can handle bulk ingests very efficiently.

See: https://elasticsearch-py.readthedocs.io/en/master/helpers.html#bulk-helpers

Describe alternatives you've considered None.

Additional context We went through this problem in a different context at Wellcome a year or so back – we were indexing a large number of documents into Elasticsearch one at a time, and started to hit speed issues. Switching to bulk indexing made the ingests more reliable and faster.

For Artefactual use:

Before you close this issue, you must check off the following:

[ ] All pull requests related to this issue are properly linked
[ ] All pull requests related to this issue have been merged
[ ] A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
[ ] Documentation regarding this issue has been written and merged
[ ] Details about this issue have been added to the release notes

alexwlchan commented 4 years ago

There's a proof of concept here: https://github.com/wellcometrust/archivematica/commit/8ff5af0e56daa9bcb6f96511c3207b9fdc8abb97. It still needs tests, comments, and probably to be applied to transferfiles as well (although I’m not sure when that index gets populated).

I had an AIP with 4381 files to index.

Without that patch (indexing one-by-one), it took 4862 seconds (81 minutes) With that patch (indexing in bulk), it took 42 seconds

sevein commented 4 years ago

Hi @alexwlchan, thank you. We made similar changes as part of https://github.com/artefactual/archivematica/pull/1463 but the commits in it didn't seem to make it to your fork. I'm pretty sure that they are part of Archivematica 1.10. I'd be curious to hear your thoughts, whether they're equivalent or there is still some room for improvement.

sevein commented 4 years ago

Sorry, one correction! Our work in https://github.com/artefactual/archivematica/pull/1463 has not been included into a public release yet. It should be part of Archivematica 1.11.

alexwlchan commented 4 years ago

Ah, I think we’ve pulled in changes from Archivematica 1.10, but not the development branch, which is why we haven’t seen this. Great minds think alike, eh? 😅

I know Helen is planning to bring in newer changes to our fork (and in general I want to start reducing the gap between the two) – I’ll have a look at #1463 and see how that compares to what I’ve done.

archivematica / Issues

Suggestion: Explore using bulk Elasticsearch operations in the "index AIP" step #1061