CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

[EPIC] Cleanup EZID queues #723

Open jsjiang opened 1 month ago

jsjiang commented 1 month ago

We run the proc-cleanup-async-queues.py script to delete successfully processed or not applicable identifiers from the binder, crossref, datacite and search queues and from the RefIdentifier table. The script filters records by the queue status (SUCCESS or IGNORED) and the update timestamps (updated in the past week). We started to run the proc-cleanup-async-queues.py script as a background job some time in 2023 (mostly likely June 29, 2023). There are some data issues with these queues:

  1. There might be a status definition change with the June 29 2023 deployment:

    • the successful status changed from "S" to "O"
    • the earliest Datacite queue record with status="O" has a timestamp "2023-06-29 05:08:25" (Thursday 5am) and seq=7993833
    • there are over 7.9m records in the DataCite that were created before the seq=7993833 record; the majority has status='S' and a small group (1550 records) has status='F'; no records were logged with other status.
    • the status change from "S" to "O" applied to the searchindex queue and binder queue as well; the latest searchindex queue record with status="S" has a timestamp "2023-06-29 05:03:43" and seq=9337226; the earliest record with status="O" has a timestamp "2023-06-29 05:08:20" and seq=9337227;
    • If status='S' was used to indicate successfully processed, then those records should be deleted from the queue
  2. The proc-cleanup-async-queues might have missed records with SUCCESS and IGNORED status in its processing window (now - 2 weeks)

    • In the DataCite queue there are over 4m records with success or ignore status that were ignored by proc-cleanup-async-queues job. These records should have been deleted when they had gone into the "past week" window.
    • The QuerySet is filtered by time range and batch size which is 100. When there are more than 100 records in the time range, then some records will be missed.
  3. We stopped the proc-binder queue on Aug 12, 2024. However we have not stopped putting records to the ezidapp_binderqueue table. There are over 560K records with UNSUBMITTED = "U" status in the binder queue since then (seq > 14941881). All binder queue records (over 12M) can be deleted.

The changing point for the datacite queue is seq=7993833:

 select *, from_unixtime(enqueueTime), from_unixtime(submitTime) from ezidapp_datacitequeue 
 where seq between 7993833-1000 and 7993833+1000;

Records that had passed the "past week" window but had not been deleted by proc-cleanup-async-queues:

select *, from_unixtime(enqueueTime), from_unixtime(submitTime) from ezidapp_datacitequeue 
 where from_unixtime(enqueueTime) > '2023-06-28'
 and from_unixtime(enqueueTime) <  '2024-08-10'
 and status in ('O', 'I');

QuerySet with time range and batch size filters that contributed to neglected records

refIdsQS = self.refIdentifier.objects.filter(
                updateTime__lte=currentTime,
                updateTime__gte=currentTime - timeDelta
            ).order_by("-pk")[: django.conf.settings.DAEMONS_MAX_BATCH_SIZE]

EZID-PRD Queue status on 2024-10-14 reported by running the diag-queue-stats command:

{
  "download": {},
  "binder": {
    "F": 6,
    "O": 4015090,
    "S": 7993776,
    "U": 669913
  },
  "datacite": {
    "F": 1565,
    "O": 5192961,
    "S": 7992232
  },
  "crossref": {
    "F": 3863,
    "I": 13155480,
    "O": 16926,
    "W": 76
  },
  "searchindexer": {
    "F": 54,
    "O": 5291658,
    "S": 9337175
  }
}
adambuttrick commented 1 month ago

Part of https://github.com/CDLUC3/ezid/issues/707

jsjiang commented 1 month ago

Related tickets: #394 https://github.com/CDLUC3/ezid/issues/371

jsjiang commented 3 weeks ago

Queue data cleanup

Queues:

Associated datasets:

Notes:

Procedures:

Exception occured while processing identifier 'ark:/99999/fk4km0qb78' for 'refId' table
   ERROR ezidapp.management.commands.proc-cleanup-async-queues_v2 7961173056: ("Cannot delete some instances of model 'RefIdentifier' because they are referenced through protected foreign keys: 'BinderQueue.refIdentifier'.", {<BinderQueue: BinderQueue(id=1, refId=ark:/99999/fk4km0qb78, op=CREATE, status="Awaiting submission", owner="admin (admin)", group="f51340b091 ()")>})
jsjiang commented 1 week ago

Implementation procedure:

jsjiang commented 1 week ago

Test on ezid-stg (script v0.0.8rc1):

Batch update:

mysql -h rds-uc3-ezid1-stg.cmcguhglinoa.us-west-2.rds.amazonaws.com -u ezidrw ezid -p < 723_update_queue_status.sql 

Batch delete:

ezid@uc3-ezidui-stg01:15:52:54:~/ezid-ops-scripts/scripts/sql_scripts$ mysql -h rds-uc3-ezid1-stg.cmcguhglinoa.us-west-2.rds.amazonaws.com -u ezidrw ezid -p < 723_delete_binder_queue.sql 

diag-queus-stats before batch update:

{
  "download": {},
  "binder": {
    "F": 1,
    "O": 44111,
    "S": 137359,
    "U": 530
  },
  "datacite": {
    "F": 19769,
    "O": 47132,
    "S": 128571
  },
  "crossref": {
    "F": 568,
    "I": 181521,
    "O": 2439,
    "W": 6
  },
  "searchindexer": {
    "F": 103,
    "O": 47401,
    "S": 752336
  }
}

diag-queus-stats after batch update:

{
  "download": {},
  "binder": {
    "F": 1,
    "O": 44111,
    "S": 137359,
    "U": 530
  },
  "datacite": {
    "F": 19769,
    "O": 175703
  },
  "crossref": {
    "F": 568,
    "I": 181521,
    "O": 2439,
    "W": 6
  },
  "searchindexer": {
    "F": 103,
    "O": 799737
  }
}

diag-queus-stats after batch delete binder records:

{
  "download": {},
  "binder": {},
  "datacite": {
    "F": 19769,
    "O": 175722
  },
  "crossref": {
    "F": 572,
    "I": 181536,
    "O": 2441,
    "W": 6
  },
  "searchindexer": {
    "F": 103,
    "O": 799756
  }
}

Dump files:

-rw-r--r--. 1 ezid ezid 10626904 Oct 21 14:18 ezidapp_binderqueue_table_dump_20241021_141823.sql
-rw-r--r--. 1 ezid ezid  9851380 Oct 21 14:18 ezidapp_crossrefqueue_table_dump_20241021_141823.sql
-rw-r--r--. 1 ezid ezid 11784727 Oct 21 14:18 ezidapp_datacitequeue_table_dump_20241021_141823.sql
-rw-r--r--. 1 ezid ezid 47093377 Oct 21 14:18 ezidapp_searchindexerqueue_table_dump_20241021_141823.sql
jsjiang commented 1 week ago

Production implementation:

Binder queue

Crossref queue:

DataCite queue:

SearchIndexer queue:

RefIdentifier table:

Queue stats before:

ezid@uc3-ezidui-prd01:05:08:11:~/ezid$ python manage.py diag-queue-stats
{
  "download": {},
  "binder": {
    "F": 6,
    "O": 4015090,
    "S": 7993776,
    "U": 669913
  },
  "datacite": {
    "F": 1565,
    "O": 5367787,
    "S": 7992232,
    "U": 28
  },
  "crossref": {
    "F": 3878,
    "I": 13330253,
    "O": 16969,
    "U": 28,
    "W": 99
  },
  "searchindexer": {
    "F": 54,
    "O": 5466911,
    "S": 9337175
  }
}

Dump files:

ezid@uc3-ezidui-prd01:05:17:53:~/tmp/ezid_sql_dump_files$ ls -l
total 3107532
-rw-r--r--. 1 ezid ezid 759329912 Oct 24 05:15 ezidapp_binderqueue_table_dump_20241024_051443.sql
-rw-r--r--. 1 ezid ezid 725560660 Oct 24 05:15 ezidapp_crossrefqueue_table_dump_20241024_051443.sql
-rw-r--r--. 1 ezid ezid 805112285 Oct 24 05:16 ezidapp_datacitequeue_table_dump_20241024_051443.sql
-rw-r--r--. 1 ezid ezid 892086020 Oct 24 05:16 ezidapp_searchindexerqueue_table_dump_20241024_051443.sql

Delete binder queue records:

Update queue status:

jsjiang commented 1 week ago

Queue stats after:

{
  "download": {},
  "binder": {},
  "datacite": {
    "F": 1565,
    "O": 13350075
  },
  "crossref": {
    "F": 3878,
    "I": 13320253,
    "O": 16903,
    "W": 99
  },
  "searchindexer": {
    "F": 54,
    "O": 14793602
  }
}

Binder queue table stats after batch delete:

Binder queue table stats after running the TRUNCATE table ezidapp_binderqueue; command:

Note: It was probably better to use "TRUNCATE table" instead of "delete from table" initially. Delete took about 50 minutes while truncate only took a few seconds. Also truncate released disk space immediately.

jsjiang commented 1 week ago

Lessons learned:

  1. "TRUNCATE table" is more efficient than "DELETE from table" in our use case since we do not need the roll back feature it provides.
  2. Stop services on the EZID instance using the sudo cdlsysctl stop service command might have triggered restarting the services by the demon agent. Check with Ashley the relationships among the Nagios settings, the deployment configuration and the cdlsysctl command.
jsjiang commented 1 week ago

Questions:

jsjiang commented 2 days ago

Deployed proc-cleanup-async-queues_v2.py on ezid-prd on Oct 30 with release v3.2.27