buda-base / ao-workflows

Use DAG platform to define and orchestrate workflows
0 stars 0 forks source link

Crawl the IIIFPRES repo for disordered json #1

Closed jimk-bdrc closed 1 year ago

jimk-bdrc commented 2 years ago

...Instead of regenerating the jsons in option 3, perhaps we can have some code that checks the order of the existing dimensions.json files and when not in order, compares it with the list on exide?

Originally posted by @eroux in https://github.com/buda-base/archive-ops/issues/608#issuecomment-1036378712

eroux commented 2 years ago

Yes, note that it could also be done by reading the file on s3 directly (iiifpres is not a point of failure then, and there's no performance impact for other users)

jimk-bdrc commented 2 years ago

@eroux @JannTibetan found these:

eroux commented 2 years ago

oh perhaps the description of the issue should be updated? these two are not disordered

jimk-bdrc commented 2 years ago

@eroux writes:

Yes, note that it could also be done by reading the file on s3 directly (iiifpres is not a point of failure then, and there's no performance impact for other users)

Yes, that's what I'm going to do - we're looking at not having anything other than image files in image group folders in the archive - just adding them to distributions as needed - they really get in the way.

jimk-bdrc commented 2 years ago

oh perhaps the description of the issue should be updated? these two are not disordered

OK, then I'll look at them on the archive and see what's wrong.

jimk-bdrc commented 2 years ago

W29628 is all set. It took a while to get to it, sorry. They usually can be fixed right after request.

JannTibetan commented 2 years ago

Works like a charm. Thanks!

JannTibetan commented 2 years ago

Oops? Did I prematurely close this issue? I'm not in position to know if the resolution of yesterday's problem entailed the resolution of the overall issue. Please re-close it if all is well now.

jimk-bdrc commented 2 years ago

Oops? Did I prematurely close this issue? I'm not in position to know if the resolution of yesterday's problem entailed the resolution of the overall issue. Please re-close it if all is well now.

You did the right thing. I was going to reopen it because it does stand for fixing them all. (On the plus side, while I was distracting myself during meditation, I figured out how to manage this task - and others. Let the hacking begin!)

JannTibetan commented 2 years ago

Great. Hack away!

jimk-bdrc commented 2 years ago

Still very much open.

jimk-bdrc commented 2 years ago

Running dagster (see repository buda-base/ao-workflows) on bodhi (DAGSTER_HOME=/vmpool/data/dagster) I've implemented phase 1, which is to scan all 32000 works that @eroux identified as being on the server scans.lst.zip )

jimk-bdrc commented 2 years ago

Running on bodhi. See http://10.0.8.121:8000

Processing 5 dimensions/sec (200msec each). 32000 / 5 = 7000 sec = 3 hours to process all works. (NB this is single threaded)

eroux commented 2 years ago

Just a small thing: the list is the works that have at least one volume with a total number of pages > 2 recorded in the database (reported from the database, there might be a few missing)

jimk-bdrc commented 1 year ago

Scan found 2188 failed image groups across 1428 works.

JannTibetan commented 1 year ago

Thank you. What kinds of actions are required to repairs these image groups?

jimk-bdrc commented 1 year ago

Thank you. What kinds of actions are required to repairs these image groups?

We only need to regenerate the dimensions.json, using components in the existing volume manifest builder.

jimk-bdrc commented 1 year ago

The fix run is underway. Inside the VPN , go to Fix-igs Run

We started at 9:44 AM, and at 12:15 PM (3.5 hours) 220 image groups out of 2188 (10%) complete. Est finish date 10 Aug 2022, 10 AM.

jimk-bdrc commented 1 year ago

ao610-fix-logs.zip This zip file contains the console logs of the jobs which:

jimk-bdrc commented 1 year ago

A quick scan of some of the works in the log zip shows them ok. Transferring to @eroux for review and closure if needed. The implementing code is in this Git repository