CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

OR on both ResourceType.General and ResourceType (free text) #690

Closed sfisher closed 1 week ago

sfisher commented 1 month ago

The current system returns results for ResourceTypeGeneral for something like "Image" and also results in the free-text ResourceType. The OpenSearch system is just checking "ResourceTypeGeneral."

This isn't strictly wrong, IMO, but we can make it behave the same by doing an OR and checking each to obtain results.

Jing mentioned these with text in ResourceType that matches a ResourceTypeGeneral are often old records using an old DataCite schema before RSGeneral was the preferred way to indicate the category of resource.

EZID may want to consider also setting the ResourceTypeGeneral to the correct item if forward migrating these old records and if it makes sense to do so.

sfisher commented 1 month ago

I believe there is some difference in some legacy data in the database and how it was translated into the searchable resource types in the past. I'm not sure how these differences are coming about since both are using the "searchable_resource_type" which is that weird code that gets assigned like "Im" (one or two letter code).

I thought maybe the old search system used a different field, but it uses the same one. https://github.com/CDLUC3/ezid/blob/7d233d3b0122ea80bc359761f7b71c20f38ac176/impl/search_util.py#L388

My function populating it is using the validatedType from the Kernel metadata, this is in the OpenSearchDoc class (you can see it in my PR).

    def searchable_resource_type(self):
        t = self.km.validatedType
        return validation.resourceTypes[t.split("/")[0]] if t is not None else ''

I believe it is populated from this code for the search table, which looks like it's using the same logic as what I'm using for OpenSearch.

https://github.com/CDLUC3/ezid/blob/7d233d3b0122ea80bc359761f7b71c20f38ac176/ezidapp/models/identifier.py#L969

So I'm not sure why this is different or how to fix this when the code is the same. Maybe some old process was different and populated some things into the search table differently in the past?

If it's important to try and capture these items with odd resource types I can add additional logic.

I thought it was weird they were slightly different when the logic was the same.

adambuttrick commented 1 week ago

Search was updated to be more general to imitate database behavior across record versions, combining multiple ORs to cover as wide as possible range as possible.