avalonmediasystem / avalon

Avalon Media System – Samvera Application
http://www.avalonmediasystem.org/
Apache License 2.0
93 stars 51 forks source link

Create Timestamp Value in Solr for Object Updates #5580

Open joncameron opened 5 months ago

joncameron commented 5 months ago

Description

There is not currently a value in Solr that reflects a "last updated" time for object updates the record. For OAI-PMH harvesting or any data feed (like Avalon's Atom feed), it would be advantageous to have some type of date modified field in Solr that will cover new records, records with updated resource description metadata, or records that have become public (so changes to access control). Having a timestamp in Solr to cover the updates to any of the Fedora objects that make up the Avalon object would be a more elegant solution than relying on the current timestamp value, as every time the record gets re-indexed the Solr timestamp value is updated. Looking only to descriptive metadata updates and not updates on other parts of the record could be a simpler way to implement this that may meet most use cases.

Reported by Josh Westgard of UMD during their project to do OAI-PMH harvesting with their Avalon instance.

Done Looks Like

elynema commented 4 months ago

This is also relevant for RMD crawls of MCO. RMD is using the same updated field, so it is going to try to crawl the entire site whenever a reindex is performed.

joncameron commented 4 months ago

The issue is that the media object has an update time, but that only happens when the metadata is updated. Changes for a section wouldn't change a media_object update timestamp. It could get tricky if there are changes to something but it doesn't change in the parent object. Could be hard.

joncameron commented 4 months ago

Where does this need to be surfaced? If just in OAI-PMH or Atom, there could be a way to do this with Solr queries—joins and other more complicated queries could make working through the object tree easier.

elynema commented 4 months ago

Could we implement a first pass solution for this for RMD to prevent mass re-crawl when we reindex where we only update the field when a new record is created or the metadata is updated? Will RMD need to know changes to a section, etc?

elynema commented 3 months ago

In discussing with Andrew re: RMD, he is monitoring for a change in title or other identifiers to update records in RMD. He uses the list of MDPI barcodes in 'other identifiers' to create a list of files in RMD. If it's too difficult for next release to create a last updated date that covers any Fedora change, we could implement a simpler solution.

joncameron commented 3 months ago

How much effort would it be update the timestamp when any aspect of the record has changed, beyond just a change for descriptive metadata?

cjcolvar commented 3 months ago

I made a variety of edits via the UI and each caused a reindex of the media object and thus an update of it's timestamp field. Is the problem more that it gets updated all of the time the item is reindexed so it might get updated even when the item iteself hasn't been updated?

joncameron commented 2 months ago

This needs more discussion and could be a 2-3 person conversation and technical time devoted to it to figure out what the shape of the change should be. Can be discussed in swarm time next week (5/8).

elynema commented 1 month ago

Yes, the issue with the timestamp field is that it gets updated any time the record gets reindexed in Solr, including full reindexes. There is not another 'last updated' field in Solr.

There is a Fedora last updated field in the MODS document. Other RDF fields on the media object would be covered by the media object's last updated time in Fedora. Because this issue is focused on the descriptive metadata, surfacing the timestamp from Fedora for the MODS document should be sufficient.

If we add an RDF field for whether an object is MDPI, changing that would not trigger the update date for the MODS document. We don't yet have a firm recommendation for disambiguating between MDPI and non-MDPI content in MCO, so we may have to iterate back to this work in the next release to address that.

joncameron commented 1 month ago
elynema commented 1 month ago

Looks like UMD is querying Solr directly to build an OAI-PMH response, which would require that this be available in the Solr response, not just in the Atom feed.

If the MODS doc has its own Solr document, just need to do a join query to get the date. If it's not in Solr already, can make a new field on the media object. Would require reindexing all media objects to get that value populated. Might be able to let the values be populated naturally when descriptive metadata is updated, but then consuming systems would have to ignore null value items as being not updated recently.

https://github.com/umd-lib/umd-oaipmh-server https://github.com/umd-lib/umd-oaipmh-server/blob/main/src/oaipmh/solr.py

elynema commented 1 week ago

@joncameron I'm not sure how to access the Solr API directly to take a look at the output and make sure the new field is there and updated properly.

I did edit a resource description this morning and it passed through to the field in the Atom feed, but that's not a very thorough test. @cjcolvar Is there a way to reindex a specific record in Solr from the console and then we can double-check it doesn't change the field in the Atom feed?

cjcolvar commented 3 days ago

@elynema The only way I know of reindexing without saving is to do it from the rails console. MediaObject.find('id').update_index

elynema commented 2 days ago

@joncameron Are you able to take a look at Solr directly to verify the fields? I'm not sure what field exactly is being populated. Do we need to ask Chris or Mason or Dananji to update an item in the index so that we can test what happens in Solr?

joncameron commented 2 days ago

I can look directly at Solr index and finish QA for this.