CDLUC3 / ezid

CDLUC3 ezid
MIT License
10 stars 4 forks source link

Refactor proc-expunge #677

Open jsjiang opened 4 months ago

jsjiang commented 4 months ago

The proc-expunge background job is designed to expunge expired test identifier. The core SQL query runs very slow. Refactor the process to improve performance.

Options:

jsjiang commented 3 months ago

RDS performance captured for 2024-08-18 with avg latency=761051.65 and rows examined= 34701719.00

expunge-2024-08-18

Sample query:


SELECT `ezidapp_identifier` . `id` , `ezidapp_identifier` . `identifier` FROM `ezidapp_identifier` WHERE ( ( `ezidapp_identifier` . `identifier` LIKE BINARY ? OR `ezidapp_identifier` . `identifier` LIKE BINARY ? OR `ezidapp_identifier` . `identifier` LIKE BINARY ? ) AND `ezidapp_identifier` . `createTime` <= ? ) LIMIT ?

SELECT `ezidapp_identifier`.`id`, `ezidapp_identifier`.`identifier` FROM `ezidapp_identifier` WHERE ((`ezidapp_identifier`.`identifier` LIKE BINARY 'ark:/99999/fk4%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.5072/FK2%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.15697/%') AND `ezidapp_identifier`.`createTime` <= 1722806703) LIMIT 100
jsjiang commented 2 weeks ago

Performance improvement options:

Estimated rows to scan without ID range limit: 17,307,222

EXPLAIN SELECT `ezidapp_identifier`.`id`, `ezidapp_identifier`.`identifier` 
FROM `ezidapp_identifier`
WHERE
((`ezidapp_identifier`.`identifier` LIKE BINARY 'ark:/99999/fk4%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.5072/FK2%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.15697/%') 
AND `ezidapp_identifier`.`createTime` <= 1729203858) 
LIMIT 100;

Estimated rows to scan with ID range limit: 312,734

EXPLAIN SELECT `ezidapp_identifier`.`id`, `ezidapp_identifier`.`identifier` 
FROM `ezidapp_identifier`
WHERE  id > 49800000 
and 
((`ezidapp_identifier`.`identifier` LIKE BINARY 'ark:/99999/fk4%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.5072/FK2%' OR `ezidapp_identifier`.`identifier` LIKE BINARY 'doi:10.15697/%') 
AND `ezidapp_identifier`.`createTime` <= 1729203858) 
LIMIT 100;
jsjiang commented 1 week ago

Created proc-expunge_v2.py in branch 677_refactor_expunge

def get_id_range_by_time(self, time_range: Q):
        first_id = last_id = None

        if time_range is not None:
            queryset = (
                ezidapp.models.identifier.Identifier.objects
                .filter(time_range).only("id").order_by("pk")
            )

            first_record = queryset.first()
            last_record = queryset.last()

            if first_record is not None:
                first_id = first_record.id

            if last_record is not None:
                last_id = last_record.id
jsjiang commented 1 day ago

Testing refactored script on ezid-stg in branch 677_refactor_expunge in the ~/tmp/ezid-cleanup-async-queue folder:

37942174

select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-09-01') and unix_timestamp('2024-10-10') order by id desc limit 1;

2024-10-10 - 2024-11-01

37942238 - 37942241

37942238

select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-10-10') and unix_timestamp('2024-11-01') order by id asc limit 1;

37942241

select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-10-10') and unix_timestamp('2024-11-01') order by id desc limit 1;

2024-11-01 - 2024-11-04

37942271 - 37942286

37942271

select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-11-01') and unix_timestamp('2024-11-04') order by id asc limit 1;

37942286

select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-11-01') and unix_timestamp('2024-11-04') order by id desc limit 1;

37942287 - 37942343, 2024-11-04 - 2024-11-13

37942287

select *, from_unixtime(createtime) from ezidapp_identifier where createTime between unix_timestamp('2024-11-04') and unix_timestamp('2024-11-13') order by id asc limit 1;

9/1 - 10/10: 0

SELECT ezidapp_identifier.id, ezidapp_identifier.identifier, from_unixtime(createtime) FROM ezidapp_identifier WHERE (ezidapp_identifier.identifier LIKE BINARY 'ark:/99999/fk4%' OR ezidapp_identifier.identifier LIKE BINARY 'doi:10.5072/FK2%' OR ezidapp_identifier.identifier LIKE BINARY 'doi:10.15697/%') AND (id between 37939481 and 37942174 );

10/10 - 11/1: 4

SELECT ezidapp_identifier.id, ezidapp_identifier.identifier, from_unixtime(createtime) FROM ezidapp_identifier WHERE (ezidapp_identifier.identifier LIKE BINARY 'ark:/99999/fk4%' OR ezidapp_identifier.identifier LIKE BINARY 'doi:10.5072/FK2%' OR ezidapp_identifier.identifier LIKE BINARY 'doi:10.15697/%') AND (id between 37942238 and 37942241 );

11/1 - 11/4: 16

SELECT ezidapp_identifier.id, ezidapp_identifier.identifier, from_unixtime(createtime) FROM ezidapp_identifier WHERE (ezidapp_identifier.identifier LIKE BINARY 'ark:/99999/fk4%' OR ezidapp_identifier.identifier LIKE BINARY 'doi:10.5072/FK2%' OR ezidapp_identifier.identifier LIKE BINARY 'doi:10.15697/%') AND (id between 37942271 and 37942286 );

57, 11/4 - 11/13

SELECT ezidapp_identifier.id, ezidapp_identifier.identifier, from_unixtime(createtime) FROM ezidapp_identifier WHERE (ezidapp_identifier.identifier LIKE BINARY 'ark:/99999/fk4%' OR ezidapp_identifier.identifier LIKE BINARY 'doi:10.5072/FK2%' OR ezidapp_identifier.identifier LIKE BINARY 'doi:10.15697/%') AND (id between 37942287 and 37942343);


Test output for date range  from 2024-09-01  to 2024-10-10 :

ezid@uc3-ezidui-stg01:12:52:18:~/tmp/ezid-cleanup-async-queue$ python manage.py proc-expunge --created_range_from 2024-09-01 --created_range_to 2024-10-10 --debug DEBUG ezidapp.management.commands.proc-expunge 139909123888960: Entering run loop... INFO ezidapp.management.commands.proc-expunge 139909123888960: Initial time range: between: 2024-09-01 and 2024-10-10, (AND: ('createTimegte', 1725174000), ('createTimelte', 1728543600)) INFO ezidapp.management.commands.proc-expunge 139909123888960: Initial ID range: 37939481 : 37942174 INFO ezidapp.management.commands.proc-expunge 139909123888960: filter: (AND: (OR: ('identifierstartswith', 'ark:/99999/fk4'), ('identifierstartswith', 'doi:10.5072/FK2'), ('identifierstartswith', 'doi:10.15697/')), ('idgte', 37939481), ('idlte', 37942174)) INFO ezidapp.management.commands.proc-expunge 139909123888960: Query returned 0 records. INFO ezidapp.management.commands.proc-expunge 139909123888960: Finished time range: between: 2024-09-01 and 2024-10-10, (AND: ('createTimegte', 1725174000), ('createTime__lte', 1728543600))


Test output for date range  from 2024-10-10  to 2024-11-01:

ezid@uc3-ezidui-stg01:12:53:29:~/tmp/ezid-cleanup-async-queue$ python manage.py proc-expunge --created_range_from 2024-10-10 --created_range_to 2024-11-01 --debug DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Entering run loop... INFO ezidapp.management.commands.proc-expunge 140262250100544: Initial time range: between: 2024-10-10 and 2024-11-01, (AND: ('createTimegte', 1728543600), ('createTimelte', 1730444400)) INFO ezidapp.management.commands.proc-expunge 140262250100544: Initial ID range: 37942238 : 37942245 INFO ezidapp.management.commands.proc-expunge 140262250100544: filter: (AND: (OR: ('identifierstartswith', 'ark:/99999/fk4'), ('identifierstartswith', 'doi:10.5072/FK2'), ('identifierstartswith', 'doi:10.15697/')), ('idgte', 37942238), ('idlte', 37942245)) INFO ezidapp.management.commands.proc-expunge 140262250100544: Query returned 4 records. DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Creating new RefIdentifier: DEBUG ezidapp.management.commands.proc-expunge 140262250100544: id = 37942242 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: datacenter = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: profile = erc DEBUG ezidapp.management.commands.proc-expunge 140262250100544: owner = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: ownergroup = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: createTime = 1730248653 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: updateTime = 1730248653 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: status = P DEBUG ezidapp.management.commands.proc-expunge 140262250100544: unavailableReason = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: exported = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefStatus = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefMessage = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: target = https://google.com DEBUG ezidapp.management.commands.proc-expunge 140262250100544: cm = b'' DEBUG ezidapp.management.commands.proc-expunge 140262250100544: metadata = {'erc.who': 'test ark who', 'erc.what': 'test ark what', 'erc.when': '2024'} DEBUG ezidapp.management.commands.proc-expunge 140262250100544: agentRole = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: isTest = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: identifier = ark:/99999/fk4xh14z30 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Creating new RefIdentifier: DEBUG ezidapp.management.commands.proc-expunge 140262250100544: id = 37942243 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: datacenter = CDL.CDL DEBUG ezidapp.management.commands.proc-expunge 140262250100544: profile = datacite DEBUG ezidapp.management.commands.proc-expunge 140262250100544: owner = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: ownergroup = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: createTime = 1730248662 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: updateTime = 1730248662 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: status = P DEBUG ezidapp.management.commands.proc-expunge 140262250100544: unavailableReason = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: exported = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefStatus = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefMessage = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: target = https://google.com DEBUG ezidapp.management.commands.proc-expunge 140262250100544: cm = b'' DEBUG ezidapp.management.commands.proc-expunge 140262250100544: metadata = {'datacite.title': 'test title', 'datacite.creator': 'test creator', 'datacite.publisher': 'test publisher', 'datacite.resourcetype': 'Book', 'datacite.publicationyear': '2024'} DEBUG ezidapp.management.commands.proc-expunge 140262250100544: agentRole = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: isTest = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: identifier = doi:10.5072/FK2HT2SF25 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Creating new RefIdentifier: DEBUG ezidapp.management.commands.proc-expunge 140262250100544: id = 37942244 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: datacenter = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: profile = erc DEBUG ezidapp.management.commands.proc-expunge 140262250100544: owner = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: ownergroup = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: createTime = 1730248711 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: updateTime = 1730248711 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: status = P DEBUG ezidapp.management.commands.proc-expunge 140262250100544: unavailableReason = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: exported = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefStatus = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefMessage = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: target = https://google.com DEBUG ezidapp.management.commands.proc-expunge 140262250100544: cm = b'' DEBUG ezidapp.management.commands.proc-expunge 140262250100544: metadata = {'erc.who': 'test ark who', 'erc.what': 'test ark what', 'erc.when': '2024'} DEBUG ezidapp.management.commands.proc-expunge 140262250100544: agentRole = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: isTest = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: identifier = ark:/99999/fk4sr0f48w DEBUG ezidapp.management.commands.proc-expunge 140262250100544: Creating new RefIdentifier: DEBUG ezidapp.management.commands.proc-expunge 140262250100544: id = 37942245 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: datacenter = CDL.CDL DEBUG ezidapp.management.commands.proc-expunge 140262250100544: profile = datacite DEBUG ezidapp.management.commands.proc-expunge 140262250100544: owner = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: ownergroup = None DEBUG ezidapp.management.commands.proc-expunge 140262250100544: createTime = 1730248718 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: updateTime = 1730248718 DEBUG ezidapp.management.commands.proc-expunge 140262250100544: status = P DEBUG ezidapp.management.commands.proc-expunge 140262250100544: unavailableReason = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: exported = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefStatus = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: crossrefMessage = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: target = https://google.com DEBUG ezidapp.management.commands.proc-expunge 140262250100544: cm = b'' DEBUG ezidapp.management.commands.proc-expunge 140262250100544: metadata = {'datacite.title': 'test title', 'datacite.creator': 'test creator', 'datacite.publisher': 'test publisher', 'datacite.resourcetype': 'Book', 'datacite.publicationyear': '2024'} DEBUG ezidapp.management.commands.proc-expunge 140262250100544: agentRole = DEBUG ezidapp.management.commands.proc-expunge 140262250100544: isTest = True DEBUG ezidapp.management.commands.proc-expunge 140262250100544: identifier = doi:10.5072/FK2D222M9D INFO ezidapp.management.commands.proc-expunge 140262250100544: Finished time range: between: 2024-10-10 and 2024-11-01, (AND: ('createTimegte', 1728543600), ('createTime__lte', 1730444400))


Test with default date range (started on 11/18, 2:20pm)

ezid@uc3-ezidui-stg01:14:20:27:~/tmp/ezid-cleanup-async-queue$ screen python manage.py proc-expunge

INFO ezidapp.management.commands.proc-expunge 139877519812416: Initial time range: between: 2024-11-01T15:20:44 and 2024-11-04T14:20:44, (AND: ('createTime__gte', 1730499644), ('createTime__lte', 1730758844))
INFO ezidapp.management.commands.proc-expunge 139877519812416: Initial ID range: 37942271 : 37942286
INFO ezidapp.management.commands.proc-expunge 139877519812416: filter: (AND: (OR: ('identifier__startswith', 'ark:/99999/fk4'), ('identifier__startswith', 'doi:10.5072/FK2'), ('identifier__startswith', 'doi:10.15697/')), ('id__gte', 37942271), ('id__lte', 37942286))
INFO ezidapp.management.commands.proc-expunge 139877519812416: Query returned 16 records.
INFO ezidapp.management.commands.proc-expunge 139877519812416: Finished time range: between: 2024-11-01T15:20:44 and 2024-11-04T14:20:44, (AND: ('createTime__gte', 1730499644), ('createTime__lte', 1730758844))
INFO ezidapp.management.commands.proc-expunge 139877519812416: Sleep 86400 sec before running next batch.