Initial pass of OpenSearch document and index

sfisher commented 6 months ago

IDK how we want to handle this since it isn't a fully finished feature, but merging into develop (or even main) shouldn't affect other working code right now.

This essentially replicates search information that is currently in the database except for a couple of things I didn't think were used for search. There may be some changes and optimizations we want to make which I think will become more obvious when working through the UI and API areas that use search and will become more obvious from there and we may do some more revisions to the doc format.

Things in this PR:

Basic generation of search document information

Manual test example

import impl.open_search as os
from ezidapp.models.identifier import Identifier
open_s = os.OpenSearch(identifier=Identifier.objects.get(identifier='doi:10.25338/B8JG7X'))
my_dict = open_s.dict_for_identifier()  # this gives a dict version of the document

open_s.index_document()  # This indexes the document (ie adds/updates the document in the opensearch index)

Automated basic unit tests

pytest  --ds=settings.tests tests/test_open_search.py

Script to update search index based on database information

The script will go through and add/update all items from the database into OpenSearch. It uses OpenSearch bulk update functionality and I had to add some workarounds for loading all records from

You can give it a primary id as an argument and it will start with the documents after that.

python manage.py opensearch-update

In the future we may want to make another argument that reindexes everything after a certain date instead to update new items only.

Adds things to SSM (already done on dev/stg and has placeholder values on production which we'll need to update once we get an OpenSearch server for that environment).
Fixed problems with the update script that would run out of memory at around 150,000 records. It turns out that the Python functools lru_cache does not work like a typical memoization library and doesn't free memory once objects are destroyed in some circumstances, so had to revert to the default params rather than the copilot suggested settings which filled up memory. :-(
Script seems running fast on dev now and doesn't have memory problems.

These changes are related to tickets #590, #591, #592 .

sfisher commented 5 months ago

I updated the whole document structure as you can see in the tests. Now it has more nesting levels -- resource instead of things like resource_creators, resource_title so it's not repetitive.

Also including a tiny bit of metadata from some of the other relations like owner, ownergroup, profile instead of just the foreign key id. These are also nested. I don't think most of these are used in search now, but I can imagine it might be useful to search for an email or name or something in opensearch.

sfisher commented 4 months ago

I think everything in here is in #649 so I can close this one.

CDLUC3 / ezid

Initial pass of OpenSearch document and index #604

Basic generation of search document information

Script to update search index based on database information