Open joncameron opened 5 months ago
This is also relevant for RMD crawls of MCO. RMD is using the same updated field, so it is going to try to crawl the entire site whenever a reindex is performed.
The issue is that the media object has an update time, but that only happens when the metadata is updated. Changes for a section wouldn't change a media_object update timestamp. It could get tricky if there are changes to something but it doesn't change in the parent object. Could be hard.
Where does this need to be surfaced? If just in OAI-PMH or Atom, there could be a way to do this with Solr queries—joins and other more complicated queries could make working through the object tree easier.
Could we implement a first pass solution for this for RMD to prevent mass re-crawl when we reindex where we only update the field when a new record is created or the metadata is updated? Will RMD need to know changes to a section, etc?
In discussing with Andrew re: RMD, he is monitoring for a change in title or other identifiers to update records in RMD. He uses the list of MDPI barcodes in 'other identifiers' to create a list of files in RMD. If it's too difficult for next release to create a last updated date that covers any Fedora change, we could implement a simpler solution.
How much effort would it be update the timestamp when any aspect of the record has changed, beyond just a change for descriptive metadata?
I made a variety of edits via the UI and each caused a reindex of the media object and thus an update of it's timestamp
field. Is the problem more that it gets updated all of the time the item is reindexed so it might get updated even when the item iteself hasn't been updated?
This needs more discussion and could be a 2-3 person conversation and technical time devoted to it to figure out what the shape of the change should be. Can be discussed in swarm time next week (5/8).
Yes, the issue with the timestamp
field is that it gets updated any time the record gets reindexed in Solr, including full reindexes.
There is not another 'last updated' field in Solr.
There is a Fedora last updated field in the MODS document. Other RDF fields on the media object would be covered by the media object's last updated time in Fedora. Because this issue is focused on the descriptive metadata, surfacing the timestamp from Fedora for the MODS document should be sufficient.
If we add an RDF field for whether an object is MDPI, changing that would not trigger the update date for the MODS document. We don't yet have a firm recommendation for disambiguating between MDPI and non-MDPI content in MCO, so we may have to iterate back to this work in the next release to address that.
Looks like UMD is querying Solr directly to build an OAI-PMH response, which would require that this be available in the Solr response, not just in the Atom feed.
If the MODS doc has its own Solr document, just need to do a join query to get the date. If it's not in Solr already, can make a new field on the media object. Would require reindexing all media objects to get that value populated. Might be able to let the values be populated naturally when descriptive metadata is updated, but then consuming systems would have to ignore null value items as being not updated recently.
https://github.com/umd-lib/umd-oaipmh-server https://github.com/umd-lib/umd-oaipmh-server/blob/main/src/oaipmh/solr.py
@joncameron I'm not sure how to access the Solr API directly to take a look at the output and make sure the new field is there and updated properly.
I did edit a resource description this morning and it passed through to the
@elynema The only way I know of reindexing without saving is to do it from the rails console.
MediaObject.find('id').update_index
@joncameron Are you able to take a look at Solr directly to verify the fields? I'm not sure what field exactly is being populated. Do we need to ask Chris or Mason or Dananji to update an item in the index so that we can test what happens in Solr?
I can look directly at Solr index and finish QA for this.
Description
There is not currently a value in Solr that reflects a "last updated" time for object updates the record. For OAI-PMH harvesting or any data feed (like Avalon's Atom feed), it would be advantageous to have some type of date modified field in Solr that will cover new records, records with updated resource description metadata, or records that have become public (so changes to access control). Having a timestamp in Solr to cover the updates to any of the Fedora objects that make up the Avalon object would be a more elegant solution than relying on the current timestamp value, as every time the record gets re-indexed the Solr timestamp value is updated. Looking only to descriptive metadata updates and not updates on other parts of the record could be a simpler way to implement this that may meet most use cases.
Reported by Josh Westgard of UMD during their project to do OAI-PMH harvesting with their Avalon instance.
Done Looks Like