Closed jimk-bdrc closed 4 months ago
On five samples, the download-debag-sync cycle takes about 11 minutes. There are approximately 1600 works in the set.
I will reset the DAG to have four instances running, and request 100 works. The restores will take about a day or so, so the downloads should start 19 May (ish). by 20 May I should have an idea of the velocity of doing 4 simultaneous processes.
Aaaaand, we're off! (the rest of the 100 is the same....)
create_time | update_time | id | object_name | restore_requested_on | restore_complete_on | download_complete_on | debag_complete_on | sync_complete_on | user_data |
---|---|---|---|---|---|---|---|---|---|
2024-05-13 15:28:54 | 2024-05-17 15:43:11 | 5344 | W1FPL2173 | 2024-05-17 15:43:11 | null | null | null | null | [{"user_data": {"aws_s3_key": "Archive1/73/W1FPL2173/W1FPL2173.bag.zip", "aws_s3_bucket": "glacier.staging.fpl.bdrc.org"}, "time_stamp": "2024-05-13T15:28:54.674280-04:00"}, {"user_data": {"restore_request_results": {"ResponseMetadata": {"HostId": "e4NJL30KAXvKbFULep2rmlKxXCOZ4d1cvlKqtby8KpEF+gZ736iCzUq1L50jrPecgVxuxSDgZf0=", "RequestId": "0MWZZQ45DMCNKRT8", "HTTPHeaders": {"date": "Fri, 17 May 2024 19:43:11 GMT", "server": "AmazonS3", "x-amz-id-2": "e4NJL30KAXvKbFULep2rmlKxXCOZ4d1cvlKqtby8KpEF+gZ736iCzUq1L50jrPecgVxuxSDgZf0=", "content-length": "0", "x-amz-request-id": "0MWZZQ45DMCNKRT8"}, "RetryAttempts": 0, "HTTPStatusCode": 202}}}, "time_stamp": "2024-05-17T15:43:11.017214-04:00"}] |
2024-05-13 15:28:56 | 2024-05-17 15:43:11 | 5395 | W1FPL2174 | 2024-05-17 15:43:12 | null | null | null | null | [{"user_data": {"aws_s3_key": "Archive1/74/W1FPL2174/W1FPL2174.bag.zip", "aws_s3_bucket": "glacier.staging.fpl.bdrc.org"}, "time_stamp": "2024-05-13T15:28:56.602798-04:00"}, {"user_data": {"restore_request_results": {"ResponseMetadata": {"HostId": "9Z9MAfpNsBAPR5CkC8UROYClvU7ORu0wRgBpUa40/rhrtvVWuW7PdxDBV1njfurFkq6d4vupSwE=", "RequestId": "F4R53K60E192HRJZ", "HTTPHeaders": {"date": "Fri, 17 May 2024 19:43:12 GMT", "server": "AmazonS3", "connection": "close", "x-amz-id-2": "9Z9MAfpNsBAPR5CkC8UROYClvU7ORu0wRgBpUa40/rhrtvVWuW7PdxDBV1njfurFkq6d4vupSwE=", "content-length": "0", "x-amz-request-id": "F4R53K60E192HRJZ"}, "RetryAttempts": 0, "HTTPStatusCode": 202}}}, "time_stamp": "2024-05-17T15:43:11.651524-04:00"}] |
2024-05-13 15:28:52 | 2024-05-17 15:43:10 | 5291 | W1FPL2172 | 2024-05-17 15:43:10 | null | null | null | null | [{"user_data": {"aws_s3_key": "Archive1/72/W1FPL2172/W1FPL2172.bag.zip", "aws_s3_bucket": "glacier.staging.fpl.bdrc.org"}, "time_stamp": "2024-05-13T15:28:52.303906-04:00"}, {"user_data": {"restore_request_results": {"ResponseMetadata": {"HostId": "Qx2pz+sGcGT7MhzDzdIvgj7FIjZmlv1WMnh6gvQYAdh1LdK+7PNEcG32BmsjQQEwa+TtE3sbJlw=", "RequestId": "0MWRCVNT8NCK2HRC", "HTTPHeaders": {"date": "Fri, 17 May 2024 19:43:11 GMT", "server": "AmazonS3", "x-amz-id-2": "Qx2pz+sGcGT7MhzDzdIvgj7FIjZmlv1WMnh6gvQYAdh1LdK+7PNEcG32BmsjQQEwa+TtE3sbJlw=", "content-length": "0", "x-amz-request-id": "0MWRCVNT8NCK2HRC"}, "RetryAttempts": 0, "HTTPStatusCode": 202}}}, "time_stamp": "2024-05-17T15:43:10.428279-04:00"}] |
Production notes: Scheduling 4 instances, with each instance schedules 1 hour apart. Each instance takes 12 or so minutes to process, with most of the work being in the debagging.
This really bogs down sattva - sometimes the scheduler loses its heartbeat.
But anyway, to compress the latency, I restarted the airflow with instances scheduled 10 minutes apart (4 instances). This should be processing roughly 1 (or 1.2) at a time.
Query to get the running list
select Workname from `drs`.dip_activity_work
where WorkName like 'W1FPL%'
and dip_activity_types_label = 'ARCHIVE'
and dip_activity_finish > '2024-05-10'
and dip_activity_result_code = 0;
@eroux, results are attached. 247 works
Get_running_FPL_archive_syncs.csv
Update query with current date (`dip_activity_finish' > '2024-05-22 13:00:00')
Hi @jimk-bdrc could you send me an updated list so I can start running them through SCAM over the week-end?
Status: (of the 937 works that need to be sent to scam) Successful syncs: 223 Incomplete Syncs: restore not requested yet: 437 Restore requested but waiting to process: 285 Restore processed: sync failed: 3 underway: 2
Of the 285 restore requested and waiting to process, the oldest one had the sync requested 5-23-24, and the longest that the restore notification remains in the queue is two weeks. However, I only set the restore duration for 5 days. So I need to keep an eye on objects that were actually restored, but we haven't been notified of, and re-restore them. But to do that, I need get_or_create functionality in the db_phase.
Of the 285 restore requested and waiting to process, the oldest one had the sync requested 5-23-24, and the longest that the restore notification remains in the queue is two weeks. However, I only set the restore duration for 5 days. So I need to keep an eye on objects that were actually restored, but we haven't been notified of, and re-restore them. But to do that, I need get_or_create functionality in the db_phase.
Here's how to track them. In the web UI, Browse -> task instances
Add filter -> Task Id Equal to. download_from_messages Add filter -> state Equal to failed
Last 285 started today. The prior batch, 153 works:
@eroux Last batch sync completed, with 34 works left to do manually. The reasons:
Cause | count | fix action |
---|---|---|
download failed | 5 | manual download |
sync failed | 23 | audit tool failures |
I will advise when 34 complete.
PS. The sync process is suspended, so no more downloads. @TBRC-Travis has syncd some through the VPN. Will check on Monday.
Failed sync notes. (all paths on sattva:~homer/prod/aow23)
When the sync succeeds, the original source is cleaned up, so what's failed is left on
~/dev/tmp/Projects/debag-sync/AO-staging-Incoming/bag-download/work
The sync, and audit tool logs, in the standard locations:
/mnt/processing/logs/sync-logs/
\<date>,Repair notes: | Work | problem | fix |
---|---|---|---|
W1FPL2284 | sequence 129 missing from sources, archive, images | resequenced it away | |
W1FPL2087 | non-duplicate image files ending in (1) | resequenced: I1FPL20870128(1).tif ==> I1FPL20870128.tif I1FPL20870128.tif ==> I1FPL20870129.tif I1FPL20870129(1).tif ==> I1FPL20870130.tif I1FPL20870129.tif ==> I1FPL20870131.tif I1FPL20870130(1).tif ==> I1FPL20870132.tif |
|
W1FPL2159 | extra file I1FPL21590003-2.xxx in archive. images. Is a looser cropped version of I1FPL21590003.xxx |
Moved to ../../../backups/... . Original is in sources | |
W1FPL2245 | sequence 2 missing from sources, archive, images | Resequenced archive and images |
Let's wait for Travis maybe before resequencing, it might be the symptom of a missing file
Thanks a lot for that, SCAM is running day and night on these images!
Let's wait for Travis maybe before resequencing, it might be the symptom of a missing file
@TBRC-Travis I'll let you know when I've analyzed all these, and if you could check to see the ones that have missing files when I'm done, we can see if they need resync.
Analysis done, syncing of works that passed is underway.
@TBRC-Travis - I've fixed up what can be fixed up, and began the sync.
The home for all the works mentioned here is sattva:~__me__/dev/tmp/Projects/debag-sync/AO-staging-Incoming/bag-download/work
The Google Sheet FPL reconciliation lists what I've found in missing files.
I basically only did two things, depending on if the files not in archive/
and images
was also missing in sources
If this 'missing in sources for the missing files is Y, I left sources as I found it, but resequenced the image files in archive/
and images/
If the 'missing in sources` was N, I created the archive/ and images/ versions from the RAW file.
There are three works, highlighted in red, that were impossible to patch. One of them required a lot of reprocessing, and while I don't mind doing my own hand reprocessing for 1 or 2 files, I felt that trying to make 20 files look just like the rest would have been a time consuming failure. Thom can reapply his usual presets and redo them.
The works I couldn't reprocess were | work | problem |
---|---|---|
3111 | Cannot convert. The missing file is in sources, but ImagingEdge cannot open ("file format not supported/The image may be corrupted" In GraphicConverter, the preview shows two foilios, but the detailed image only shows one. | |
3139 | Different missing ranges in archive and images, 100 pages need processing | |
3239 | Huge number of missing files in all three media |
Syncs that failed 2024-06-20. all are audit-tool failures
The following works could not be published. See log file: /mnt/processing/logs/sync-logs/2024-06-20/2024-06-20_13.56.44/sync-2024-06-20_13.56.44.log | w | cause | fix |
---|---|---|---|
W1FPL2087 | result filenames with () | sync - delete | |
W1FPL2482 W1FPL3166 |
result filenames with 000nn | sync -delete | |
W1FPL2543 | images missing 3 | plain resync | |
W1FPL3191 | 240..241 missing | overlooked in first processing, jsut resync no delete | |
W1FPL3305 W1FPL3477 |
000nn | resync -delete |
These are all archived and deep archived.
@eroux needs these works to be brought in to send to SCAM.
W1FPL2080 to W1FPL3800