CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

Initial pass of OpenSearch document and index #604

Closed sfisher closed 4 months ago

sfisher commented 6 months ago

IDK how we want to handle this since it isn't a fully finished feature, but merging into develop (or even main) shouldn't affect other working code right now.

This essentially replicates search information that is currently in the database except for a couple of things I didn't think were used for search. There may be some changes and optimizations we want to make which I think will become more obvious when working through the UI and API areas that use search and will become more obvious from there and we may do some more revisions to the doc format.

Things in this PR:

Basic generation of search document information

Manual test example

import impl.open_search as os
from ezidapp.models.identifier import Identifier
open_s = os.OpenSearch(identifier=Identifier.objects.get(identifier='doi:10.25338/B8JG7X'))
my_dict = open_s.dict_for_identifier()  # this gives a dict version of the document

open_s.index_document()  # This indexes the document (ie adds/updates the document in the opensearch index)

Automated basic unit tests

pytest  --ds=settings.tests tests/test_open_search.py

Script to update search index based on database information

The script will go through and add/update all items from the database into OpenSearch. It uses OpenSearch bulk update functionality and I had to add some workarounds for loading all records from

You can give it a primary id as an argument and it will start with the documents after that.

python manage.py opensearch-update

In the future we may want to make another argument that reindexes everything after a certain date instead to update new items only.


These changes are related to tickets #590, #591, #592 .

sfisher commented 5 months ago

I updated the whole document structure as you can see in the tests. Now it has more nesting levels -- resource instead of things like resource_creators, resource_title so it's not repetitive.

Also including a tiny bit of metadata from some of the other relations like owner, ownergroup, profile instead of just the foreign key id. These are also nested. I don't think most of these are used in search now, but I can imagine it might be useful to search for an email or name or something in opensearch.

sfisher commented 4 months ago

I think everything in here is in #649 so I can close this one.