Open jsjiang opened 4 months ago
RDS performance captured for 2024-08-18 with avg latency=761051.65 and rows examined= 34701719.00
Sample query:
SELECT `ezidapp_identifier` . `id` , `ezidapp_identifier` . `identifier` FROM `ezidapp_identifier` WHERE ( ( `ezidapp_identifier` . `identifier` LIKE BINARY ? OR `ezidapp_identifier` . `identifier` LIKE BINARY ? OR `ezidapp_identifier` . `identifier` LIKE BINARY ? ) AND `ezidapp_identifier` . `createTime` <= ? ) LIMIT ?
SELECT `ezidapp_identifier`.`id`, `ezidapp_identifier`.`identifier` FROM `ezidapp_identifier` WHERE ((`ezidapp_identifier`.`identifier` LIKE BINARY 'ark:/99999/fk4%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.5072/FK2%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.15697/%') AND `ezidapp_identifier`.`createTime` <= 1722806703) LIMIT 100
Performance improvement options:
Estimated rows to scan without ID range limit: 17,307,222
EXPLAIN SELECT `ezidapp_identifier`.`id`, `ezidapp_identifier`.`identifier`
FROM `ezidapp_identifier`
WHERE
((`ezidapp_identifier`.`identifier` LIKE BINARY 'ark:/99999/fk4%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.5072/FK2%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.15697/%')
AND `ezidapp_identifier`.`createTime` <= 1729203858)
LIMIT 100;
Estimated rows to scan with ID range limit: 312,734
EXPLAIN SELECT `ezidapp_identifier`.`id`, `ezidapp_identifier`.`identifier`
FROM `ezidapp_identifier`
WHERE id > 49800000
and
((`ezidapp_identifier`.`identifier` LIKE BINARY 'ark:/99999/fk4%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.5072/FK2%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.15697/%')
AND `ezidapp_identifier`.`createTime` <= 1729203858)
LIMIT 100;
Created proc-expunge_v2.py
in branch 677_refactor_expunge
def get_id_range_by_time(self, time_range: Q):
first_id = last_id = None
if time_range is not None:
queryset = (
ezidapp.models.identifier.Identifier.objects
.filter(time_range).only("id").order_by("pk")
)
first_record = queryset.first()
last_record = queryset.last()
if first_record is not None:
first_id = first_record.id
if last_record is not None:
last_id = last_record.id
Testing refactored script on ezid-stg in branch 677_refactor_expunge
in the ~/tmp/ezid-cleanup-async-queue
folder:
# between 2024-09-01 and 2024-10-10
# 37939481 - 37942174
# 37939481
select *, from_unixtime(createtime) from ezidapp_identifier
where createTime between unix_timestamp('2024-09-01')
and unix_timestamp('2024-10-10')
order by id asc limit 1;
select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-09-01') and unix_timestamp('2024-10-10') order by id desc limit 1;
select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-10-10') and unix_timestamp('2024-11-01') order by id asc limit 1;
select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-10-10') and unix_timestamp('2024-11-01') order by id desc limit 1;
select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-11-01') and unix_timestamp('2024-11-04') order by id asc limit 1;
select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-11-01') and unix_timestamp('2024-11-04') order by id desc limit 1;
select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-11-04') and unix_timestamp('2024-11-13') order by id asc limit 1;
SELECT ezidapp_identifier
.id
, ezidapp_identifier
.identifier
, from_unixtime(createtime)
FROM ezidapp_identifier
WHERE (ezidapp_identifier
.identifier
LIKE BINARY 'ark:/99999/fk4%'
OR ezidapp_identifier
.identifier
LIKE BINARY 'doi:10.5072/FK2%'
OR ezidapp_identifier
.identifier
LIKE BINARY 'doi:10.15697/%')
AND (id between 37939481 and 37942174 );
SELECT ezidapp_identifier
.id
, ezidapp_identifier
.identifier
, from_unixtime(createtime)
FROM ezidapp_identifier
WHERE (ezidapp_identifier
.identifier
LIKE BINARY 'ark:/99999/fk4%'
OR ezidapp_identifier
.identifier
LIKE BINARY 'doi:10.5072/FK2%'
OR ezidapp_identifier
.identifier
LIKE BINARY 'doi:10.15697/%')
AND (id between 37942238 and 37942241 );
SELECT ezidapp_identifier
.id
, ezidapp_identifier
.identifier
, from_unixtime(createtime)
FROM ezidapp_identifier
WHERE (ezidapp_identifier
.identifier
LIKE BINARY 'ark:/99999/fk4%'
OR ezidapp_identifier
.identifier
LIKE BINARY 'doi:10.5072/FK2%'
OR ezidapp_identifier
.identifier
LIKE BINARY 'doi:10.15697/%')
AND (id between 37942271 and 37942286 );
SELECT ezidapp_identifier
.id
, ezidapp_identifier
.identifier
, from_unixtime(createtime)
FROM ezidapp_identifier
WHERE (ezidapp_identifier
.identifier
LIKE BINARY 'ark:/99999/fk4%'
OR ezidapp_identifier
.identifier
LIKE BINARY 'doi:10.5072/FK2%'
OR ezidapp_identifier
.identifier
LIKE BINARY 'doi:10.15697/%')
AND (id between 37942287 and 37942343);
Test output for date range from 2024-09-01 to 2024-10-10 :
ezid@uc3-ezidui-stg01:12:52:18:~/tmp/ezid-cleanup-async-queue$ python manage.py proc-expunge --created_range_from 2024-09-01 --created_range_to 2024-10-10 --debug DEBUG ezidapp.management.commands.proc-expunge 139909123888960: Entering run loop... INFO ezidapp.management.commands.proc-expunge 139909123888960: Initial time range: between: 2024-09-01 and 2024-10-10, (AND: ('createTimegte', 1725174000), ('createTimelte', 1728543600)) INFO ezidapp.management.commands.proc-expunge 139909123888960: Initial ID range: 37939481 : 37942174 INFO ezidapp.management.commands.proc-expunge 139909123888960: filter: (AND: (OR: ('identifierstartswith', 'ark:/99999/fk4'), ('identifierstartswith', 'doi:10.5072/FK2'), ('identifierstartswith', 'doi:10.15697/')), ('idgte', 37939481), ('idlte', 37942174)) INFO ezidapp.management.commands.proc-expunge 139909123888960: Query returned 0 records. INFO ezidapp.management.commands.proc-expunge 139909123888960: Finished time range: between: 2024-09-01 and 2024-10-10, (AND: ('createTimegte', 1725174000), ('createTime__lte', 1728543600))
Test output for date range from 2024-10-10 to 2024-11-01:
ezid@uc3-ezidui-stg01:12:53:29:~/tmp/ezid-cleanup-async-queue$ python manage.py proc-expunge --created_range_from 2024-10-10 --created_range_to 2024-11-01 --debug DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Entering run loop... INFO ezidapp.management.commands.proc-expunge 140262250100544: Initial time range: between: 2024-10-10 and 2024-11-01, (AND: ('createTimegte', 1728543600), ('createTimelte', 1730444400)) INFO ezidapp.management.commands.proc-expunge 140262250100544: Initial ID range: 37942238 : 37942245 INFO ezidapp.management.commands.proc-expunge 140262250100544: filter: (AND: (OR: ('identifierstartswith', 'ark:/99999/fk4'), ('identifierstartswith', 'doi:10.5072/FK2'), ('identifierstartswith', 'doi:10.15697/')), ('idgte', 37942238), ('idlte', 37942245)) INFO ezidapp.management.commands.proc-expunge 140262250100544: Query returned 4 records. DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Creating new RefIdentifier: DEBUG ezidapp.management.commands.proc-expunge 140262250100544: id = 37942242 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: datacenter = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: profile = erc DEBUG ezidapp.management.commands.proc-expunge 140262250100544: owner = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: ownergroup = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: createTime = 1730248653 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: updateTime = 1730248653 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: status = P DEBUG ezidapp.management.commands.proc-expunge 140262250100544: unavailableReason = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: exported = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefStatus = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefMessage = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: target = https://google.com DEBUG ezidapp.management.commands.proc-expunge 140262250100544: cm = b'' DEBUG ezidapp.management.commands.proc-expunge 140262250100544: metadata = {'erc.who': 'test ark who', 'erc.what': 'test ark what', 'erc.when': '2024'} DEBUG ezidapp.management.commands.proc-expunge 140262250100544: agentRole = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: isTest = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: identifier = ark:/99999/fk4xh14z30 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Creating new RefIdentifier: DEBUG ezidapp.management.commands.proc-expunge 140262250100544: id = 37942243 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: datacenter = CDL.CDL DEBUG ezidapp.management.commands.proc-expunge 140262250100544: profile = datacite DEBUG ezidapp.management.commands.proc-expunge 140262250100544: owner = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: ownergroup = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: createTime = 1730248662 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: updateTime = 1730248662 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: status = P DEBUG ezidapp.management.commands.proc-expunge 140262250100544: unavailableReason = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: exported = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefStatus = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefMessage = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: target = https://google.com DEBUG ezidapp.management.commands.proc-expunge 140262250100544: cm = b'' DEBUG ezidapp.management.commands.proc-expunge 140262250100544: metadata = {'datacite.title': 'test title', 'datacite.creator': 'test creator', 'datacite.publisher': 'test publisher', 'datacite.resourcetype': 'Book', 'datacite.publicationyear': '2024'} DEBUG ezidapp.management.commands.proc-expunge 140262250100544: agentRole = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: isTest = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: identifier = doi:10.5072/FK2HT2SF25 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Creating new RefIdentifier: DEBUG ezidapp.management.commands.proc-expunge 140262250100544: id = 37942244 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: datacenter = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: profile = erc DEBUG ezidapp.management.commands.proc-expunge 140262250100544: owner = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: ownergroup = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: createTime = 1730248711 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: updateTime = 1730248711 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: status = P DEBUG ezidapp.management.commands.proc-expunge 140262250100544: unavailableReason = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: exported = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefStatus = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefMessage = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: target = https://google.com DEBUG ezidapp.management.commands.proc-expunge 140262250100544: cm = b'' DEBUG ezidapp.management.commands.proc-expunge 140262250100544: metadata = {'erc.who': 'test ark who', 'erc.what': 'test ark what', 'erc.when': '2024'} DEBUG ezidapp.management.commands.proc-expunge 140262250100544: agentRole = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: isTest = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: identifier = ark:/99999/fk4sr0f48w DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Creating new RefIdentifier: DEBUG ezidapp.management.commands.proc-expunge 140262250100544: id = 37942245 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: datacenter = CDL.CDL DEBUG ezidapp.management.commands.proc-expunge 140262250100544: profile = datacite DEBUG ezidapp.management.commands.proc-expunge 140262250100544: owner = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: ownergroup = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: createTime = 1730248718 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: updateTime = 1730248718 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: status = P DEBUG ezidapp.management.commands.proc-expunge 140262250100544: unavailableReason = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: exported = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefStatus = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefMessage = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: target = https://google.com DEBUG ezidapp.management.commands.proc-expunge 140262250100544: cm = b'' DEBUG ezidapp.management.commands.proc-expunge 140262250100544: metadata = {'datacite.title': 'test title', 'datacite.creator': 'test creator', 'datacite.publisher': 'test publisher', 'datacite.resourcetype': 'Book', 'datacite.publicationyear': '2024'} DEBUG ezidapp.management.commands.proc-expunge 140262250100544: agentRole = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: isTest = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: identifier = doi:10.5072/FK2D222M9D INFO ezidapp.management.commands.proc-expunge 140262250100544: Finished time range: between: 2024-10-10 and 2024-11-01, (AND: ('createTimegte', 1728543600), ('createTime__lte', 1730444400))
Test with default date range (started on 11/18, 2:20pm)
ezid@uc3-ezidui-stg01:14:20:27:~/tmp/ezid-cleanup-async-queue$ screen python manage.py proc-expunge
INFO ezidapp.management.commands.proc-expunge 139877519812416: Initial time range: between: 2024-11-01T15:20:44 and 2024-11-04T14:20:44, (AND: ('createTime__gte', 1730499644), ('createTime__lte', 1730758844))
INFO ezidapp.management.commands.proc-expunge 139877519812416: Initial ID range: 37942271 : 37942286
INFO ezidapp.management.commands.proc-expunge 139877519812416: filter: (AND: (OR: ('identifier__startswith', 'ark:/99999/fk4'), ('identifier__startswith', 'doi:10.5072/FK2'), ('identifier__startswith', 'doi:10.15697/')), ('id__gte', 37942271), ('id__lte', 37942286))
INFO ezidapp.management.commands.proc-expunge 139877519812416: Query returned 16 records.
INFO ezidapp.management.commands.proc-expunge 139877519812416: Finished time range: between: 2024-11-01T15:20:44 and 2024-11-04T14:20:44, (AND: ('createTime__gte', 1730499644), ('createTime__lte', 1730758844))
INFO ezidapp.management.commands.proc-expunge 139877519812416: Sleep 86400 sec before running next batch.
The
proc-expunge
background job is designed to expunge expired test identifier. The core SQL query runs very slow. Refactor the process to improve performance.Options:
proc-expunge
job starts to delete expired identifier when the job is started on EZID. It takes about an hour for the job to finish one round of check. The job then goes to sleep for 24 hours before it starts another round of check. So the start time for deleting expired identifiers shifts about an hour each day.