buda-base / ao-workflows

Use DAG platform to define and orchestrate workflows
0 stars 0 forks source link

First FPL to bring in to archive #23

Closed jimk-bdrc closed 4 months ago

jimk-bdrc commented 6 months ago

@eroux needs these works to be brought in to send to SCAM.

W1FPL2080 to W1FPL3800

jimk-bdrc commented 6 months ago

On five samples, the download-debag-sync cycle takes about 11 minutes. There are approximately 1600 works in the set.

I will reset the DAG to have four instances running, and request 100 works. The restores will take about a day or so, so the downloads should start 19 May (ish). by 20 May I should have an idea of the velocity of doing 4 simultaneous processes.

jimk-bdrc commented 6 months ago

Aaaaand, we're off! (the rest of the 100 is the same....)

create_time update_time id object_name restore_requested_on restore_complete_on download_complete_on debag_complete_on sync_complete_on user_data
2024-05-13 15:28:54 2024-05-17 15:43:11 5344 W1FPL2173 2024-05-17 15:43:11 null null null null [{"user_data": {"aws_s3_key": "Archive1/73/W1FPL2173/W1FPL2173.bag.zip", "aws_s3_bucket": "glacier.staging.fpl.bdrc.org"}, "time_stamp": "2024-05-13T15:28:54.674280-04:00"}, {"user_data": {"restore_request_results": {"ResponseMetadata": {"HostId": "e4NJL30KAXvKbFULep2rmlKxXCOZ4d1cvlKqtby8KpEF+gZ736iCzUq1L50jrPecgVxuxSDgZf0=", "RequestId": "0MWZZQ45DMCNKRT8", "HTTPHeaders": {"date": "Fri, 17 May 2024 19:43:11 GMT", "server": "AmazonS3", "x-amz-id-2": "e4NJL30KAXvKbFULep2rmlKxXCOZ4d1cvlKqtby8KpEF+gZ736iCzUq1L50jrPecgVxuxSDgZf0=", "content-length": "0", "x-amz-request-id": "0MWZZQ45DMCNKRT8"}, "RetryAttempts": 0, "HTTPStatusCode": 202}}}, "time_stamp": "2024-05-17T15:43:11.017214-04:00"}]
2024-05-13 15:28:56 2024-05-17 15:43:11 5395 W1FPL2174 2024-05-17 15:43:12 null null null null [{"user_data": {"aws_s3_key": "Archive1/74/W1FPL2174/W1FPL2174.bag.zip", "aws_s3_bucket": "glacier.staging.fpl.bdrc.org"}, "time_stamp": "2024-05-13T15:28:56.602798-04:00"}, {"user_data": {"restore_request_results": {"ResponseMetadata": {"HostId": "9Z9MAfpNsBAPR5CkC8UROYClvU7ORu0wRgBpUa40/rhrtvVWuW7PdxDBV1njfurFkq6d4vupSwE=", "RequestId": "F4R53K60E192HRJZ", "HTTPHeaders": {"date": "Fri, 17 May 2024 19:43:12 GMT", "server": "AmazonS3", "connection": "close", "x-amz-id-2": "9Z9MAfpNsBAPR5CkC8UROYClvU7ORu0wRgBpUa40/rhrtvVWuW7PdxDBV1njfurFkq6d4vupSwE=", "content-length": "0", "x-amz-request-id": "F4R53K60E192HRJZ"}, "RetryAttempts": 0, "HTTPStatusCode": 202}}}, "time_stamp": "2024-05-17T15:43:11.651524-04:00"}]
2024-05-13 15:28:52 2024-05-17 15:43:10 5291 W1FPL2172 2024-05-17 15:43:10 null null null null [{"user_data": {"aws_s3_key": "Archive1/72/W1FPL2172/W1FPL2172.bag.zip", "aws_s3_bucket": "glacier.staging.fpl.bdrc.org"}, "time_stamp": "2024-05-13T15:28:52.303906-04:00"}, {"user_data": {"restore_request_results": {"ResponseMetadata": {"HostId": "Qx2pz+sGcGT7MhzDzdIvgj7FIjZmlv1WMnh6gvQYAdh1LdK+7PNEcG32BmsjQQEwa+TtE3sbJlw=", "RequestId": "0MWRCVNT8NCK2HRC", "HTTPHeaders": {"date": "Fri, 17 May 2024 19:43:11 GMT", "server": "AmazonS3", "x-amz-id-2": "Qx2pz+sGcGT7MhzDzdIvgj7FIjZmlv1WMnh6gvQYAdh1LdK+7PNEcG32BmsjQQEwa+TtE3sbJlw=", "content-length": "0", "x-amz-request-id": "0MWRCVNT8NCK2HRC"}, "RetryAttempts": 0, "HTTPStatusCode": 202}}}, "time_stamp": "2024-05-17T15:43:10.428279-04:00"}]
jimk-bdrc commented 6 months ago

Production notes: Scheduling 4 instances, with each instance schedules 1 hour apart. Each instance takes 12 or so minutes to process, with most of the work being in the debagging.

This really bogs down sattva - sometimes the scheduler loses its heartbeat.

But anyway, to compress the latency, I restarted the airflow with instances scheduled 10 minutes apart (4 instances). This should be processing roughly 1 (or 1.2) at a time.

jimk-bdrc commented 6 months ago

Query to get the running list

select Workname from `drs`.dip_activity_work 
                where WorkName like 'W1FPL%' 
                  and dip_activity_types_label = 'ARCHIVE' 
                  and dip_activity_finish > '2024-05-10' 
                  and dip_activity_result_code = 0;

@eroux, results are attached. 247 works

Get_running_FPL_archive_syncs.csv

Update query with current date (`dip_activity_finish' > '2024-05-22 13:00:00')

eroux commented 6 months ago

Hi @jimk-bdrc could you send me an updated list so I can start running them through SCAM over the week-end?

jimk-bdrc commented 6 months ago

Status: (of the 937 works that need to be sent to scam) Successful syncs: 223 Incomplete Syncs: restore not requested yet: 437 Restore requested but waiting to process: 285 Restore processed: sync failed: 3 underway: 2

Of the 285 restore requested and waiting to process, the oldest one had the sync requested 5-23-24, and the longest that the restore notification remains in the queue is two weeks. However, I only set the restore duration for 5 days. So I need to keep an eye on objects that were actually restored, but we haven't been notified of, and re-restore them. But to do that, I need get_or_create functionality in the db_phase.

jimk-bdrc commented 6 months ago

Of the 285 restore requested and waiting to process, the oldest one had the sync requested 5-23-24, and the longest that the restore notification remains in the queue is two weeks. However, I only set the restore duration for 5 days. So I need to keep an eye on objects that were actually restored, but we haven't been notified of, and re-restore them. But to do that, I need get_or_create functionality in the db_phase.

Here's how to track them. In the web UI, Browse -> task instances

image

Add filter -> Task Id Equal to. download_from_messages Add filter -> state Equal to failed

image
jimk-bdrc commented 6 months ago

Last 285 started today. The prior batch, 153 works:

jimk-bdrc commented 5 months ago

@eroux Last batch sync completed, with 34 works left to do manually. The reasons:

Cause count fix action
download failed 5 manual download
sync failed 23 audit tool failures

I will advise when 34 complete.

PS. The sync process is suspended, so no more downloads. @TBRC-Travis has syncd some through the VPN. Will check on Monday.

jimk-bdrc commented 5 months ago

Failed sync notes. (all paths on sattva:~homer/prod/aow23) When the sync succeeds, the original source is cleaned up, so what's failed is left on ~/dev/tmp/Projects/debag-sync/AO-staging-Incoming/bag-download/work

The sync, and audit tool logs, in the standard locations:

jimk-bdrc commented 5 months ago
Repair notes: Work problem fix
W1FPL2284 sequence 129 missing from sources, archive, images resequenced it away
W1FPL2087 non-duplicate image files ending in (1) resequenced:
I1FPL20870128(1).tif ==> I1FPL20870128.tif
I1FPL20870128.tif ==> I1FPL20870129.tif
I1FPL20870129(1).tif ==> I1FPL20870130.tif
I1FPL20870129.tif ==> I1FPL20870131.tif
I1FPL20870130(1).tif ==> I1FPL20870132.tif
W1FPL2159 extra file I1FPL21590003-2.xxx in archive. images.
Is a looser cropped version of I1FPL21590003.xxx
Moved to ../../../backups/... . Original is in sources
W1FPL2245 sequence 2 missing from sources, archive, images Resequenced archive and images
eroux commented 5 months ago

Let's wait for Travis maybe before resequencing, it might be the symptom of a missing file

eroux commented 5 months ago

Thanks a lot for that, SCAM is running day and night on these images!

jimk-bdrc commented 5 months ago

Let's wait for Travis maybe before resequencing, it might be the symptom of a missing file

@TBRC-Travis I'll let you know when I've analyzed all these, and if you could check to see the ones that have missing files when I'm done, we can see if they need resync.

jimk-bdrc commented 5 months ago

Analysis done, syncing of works that passed is underway.

jimk-bdrc commented 5 months ago

@TBRC-Travis - I've fixed up what can be fixed up, and began the sync.

The home for all the works mentioned here is sattva:~__me__/dev/tmp/Projects/debag-sync/AO-staging-Incoming/bag-download/work

The Google Sheet FPL reconciliation lists what I've found in missing files.

I basically only did two things, depending on if the files not in archive/ and images was also missing in sources

If this 'missing in sources for the missing files is Y, I left sources as I found it, but resequenced the image files in archive/ and images/

If the 'missing in sources` was N, I created the archive/ and images/ versions from the RAW file.

There are three works, highlighted in red, that were impossible to patch. One of them required a lot of reprocessing, and while I don't mind doing my own hand reprocessing for 1 or 2 files, I felt that trying to make 20 files look just like the rest would have been a time consuming failure. Thom can reapply his usual presets and redo them.

The works I couldn't reprocess were work problem
3111 Cannot convert. The missing file is in sources, but ImagingEdge cannot open ("file format not supported/The image may be corrupted" In GraphicConverter, the preview shows two foilios, but the detailed image only shows one.
3139 Different missing ranges in archive and images, 100 pages need processing
3239 Huge number of missing files in all three media
jimk-bdrc commented 5 months ago

Syncs that failed 2024-06-20. all are audit-tool failures

The following works could not be published. See log file: /mnt/processing/logs/sync-logs/2024-06-20/2024-06-20_13.56.44/sync-2024-06-20_13.56.44.log w cause fix
W1FPL2087 result filenames with () sync - delete
W1FPL2482
W1FPL3166
result filenames with 000nn sync -delete
W1FPL2543 images missing 3 plain resync
W1FPL3191 240..241 missing overlooked in first processing, jsut resync no delete
W1FPL3305
W1FPL3477
000nn resync -delete
jimk-bdrc commented 4 months ago

These are all archived and deep archived.