OregonDigital / OD2

Next generation of Oregon Digital ( https://oregondigital.org ) digital collections platform, built on Samvera Hyrax ( https://github.com/samvera/hyrax/ )
18 stars 1 forks source link

Indexing works should attempt to fetch from cache before starting async fetch job #1207

Open CGillen opened 4 years ago

CGillen commented 4 years ago

Descriptive summary

When a work is saved the solr document is reindexed as a blocking process. We initially broke the reindex into two steps, the blocking part for all metadata that is readily available from the form, and an asynchronous step for fetching labels to controlled vocabulary metadata. This resulted in a temporary phase where the work had SOME but not ALL metadata visible on it.

The asynchronous part was fine for works being ingested and reviewed and the occasional save while fixing something. However, saving also occurs when a work is added to a collection and this can be done anytime by any user, including non-admins, which meant there may be extended periods of time where metadata would just disappear from works.

The solution, is two parts, #1205 to speed up and incrementally add metadata labels as we fetch it, and THIS ticket.

When a work is saved, the graph cache (currently Blazegraph) should be queried first during the blocking phase. If the cache does not contain our label, we should move that fetch off to the asynchronous job. We examined the possibility of using Solr Atomic Updates, which may have allowed us to fire off the asynchronous jobs as small single jobs for each miss, but our setup does not allow this. Our solution has become to move the responsibility of firing the job off up a little and batch all misses together for a save.

Expected behavior

When a work is saved, it attempts to fetch metadata labels from Blazegraph FIRST, during the blocking phase. For any labels that are not in Blazegraph, those fetches are batched together for an asynchronous job.

Related work

1206 is step one fix

1198 Pre-discussion

1196 temporary fix

1277 enhancement for dup indexes

Accessibility Concerns

CGillen commented 4 years ago

This issue was discussed in the #od2-developers channel on 07/19/2020 and in stand-ups with metadata team present. ADR to come

CGillen commented 4 years ago

From #1198 pre-discussion @straleyb said:

If there is any info we would like to catalog, this is the place to do it. My proposed fix was to keep a second copy of metadata around while an object is being updated or added to a collection, or other things like this. Any time a reindex occurs the data wont show up until a duration of time after. But by keeping a second record in solr around, we can use that until the update finishes, then add the new data to the old solr record.

And @CGillen replied:

An idea I had was to merge the current Solr document and the one being generated by the indexer (before the fetch job is started). While generating the replacement Solr document, we can look at the current Solr document and match up any URIs that are in both, then pull down the corresponding label.
This would get us temporary labels while the authorities/blazegraph are being hit and any new or updated labels should come through when the async job finishes.

https://github.com/samvera/hyrax/blob/v2.7.2/app/indexers/hyrax/work_indexer.rb#L7
This is basically where we would want to do that work.
CGillen commented 4 years ago

https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-OptimisticConcurrency