avalonmediasystem / avalon

Avalon Media System – Samvera Application
http://www.avalonmediasystem.org/
Apache License 2.0
94 stars 51 forks source link

Check that section_list migration persisted to Fedora correctly #6033

Closed cjcolvar closed 2 days ago

cjcolvar commented 1 month ago

Description

When looking at the Fedora 6 migrated OCFL data on disk on avalon-dev as part of #5978 I noticed that the section list triple was missing from the item I spot checked. It appears that the item had been indexed with the section list but it hadn't been persisted to fedora. But I spot checked a couple items on MCO and they both had the section list in fedora. We should quickly investigate why section lists are missing on avalon-dev and if it is a problem with the section list migration.

Done Looks Like

joncameron commented 1 month ago

Could be a data issue on avalon-dev rather than a wide-spread issue, but worth spot checking.

cjcolvar commented 2 weeks ago

I checked avalon-staging, ijccr-staging, archivo-staging, ijccr, and archivo and the section list migration ran fine on all of them. Looking at avalon-devel it looks like over half of the items did not get migrated. I think this was either because the migration was never run or an early version of the migration which didn't skip validations. I manually ran saves (skipping validations) on all of the MediaObjects that didn't have section_list triples (effectively running the migration again) and everything looks good. I think this points to the migration being fine and avalon-dev having bad data. I could run this check on MCO-staging or MCO but it would probably take a day and might need some changes to ensure it can handle the scale.

Here is what I ran:

ids = ActiveFedora::SolrService.instance.conn.get("select", params: {rows: 1000, q: "has_model_ssim:MediaObject", fl: ['id']})["response"]["docs"].pluck("id")
conn = ActiveFedora.fedora.connection.http
ids_missing_section_list = ids.select {|id| !conn.get(MediaObject.id_to_uri(id)).body.include? "section_list" }
cjcolvar commented 2 weeks ago

@elynema @joncameron Do you think I should try running this check on MCO too?

elynema commented 2 weeks ago

@cjcolvar My preference is yes, but I'm curious about Jon's opinion.

joncameron commented 1 week ago

Yes for me too. Seems worth the extra overhead just in case bad data is present in MCO environments.

cjcolvar commented 4 days ago

I ran the following script on mco-staging and MCO. mco-staging didn't report any ids as Not Found or missing. MCO reported 92 with missing and none as Not Found but upon further investigation all of those items had been deleted since the section_list migration.

cjcolvar commented 2 days ago

This is the code I ran. Note that it hasn't been fixed to deal with the misidentification of section lists missing instead of objects being not found/deleted.

#!/usr/bin/env ruby

conn = ActiveFedora.fedora.connection.http

out = File.open("log/section_list_dump.fedora.txt", "a+")
last_id = out.readlines&.last&.split(':')&.first
puts "Resuming from #{last_id}" if last_id.present?

# Read from log/section_list_dump.post_migration.txt
File.readlines("log/section_list_dump.post_migration.txt").each do |line|
  id = line.split(':')&.first
  next unless id.present? && id.length == 9
  next if last_id.present? && id <= last_id

  # Make curl request to fedora and capture section_list triple
  response = conn.get(MediaObject.id_to_uri(id), {}, {"Accept" => "application/ld+json", "Prefer" => "return=minimal"})
  if response.status == 404
    out.puts "#{id}: Not Found"
    next
  end

  section_list = JSON.parse(response.body).dig(0, "http://avalonmediasystem.org/rdf/vocab/media_object#section_list", 0, "@value") rescue nil
  unless section_list.present?
    out.puts "#{id}: Section list missing from fedora"
    next
  end

  out.puts "#{id}: section_ids: #{section_list}"
end

out.close
puts "Completed"