Open jsjiang opened 2 months ago
The proc-link-checker-update job runs everyday starting from 1am and lasts 2-3 hours. This process is resource intensive on the RDS. We have noticed resource bottlenecks everyday between 1:30-3:30am
The core SQL statements which is created by the _harvest
function to retrieve records from the searchIdentifier
and the linkChecker
table may need refactoring to improve performance.
Refactor options:
list()
- use list will read in DB output to local (EC2) memorylen()
functionCurrent code:
def _harvest(self, model, only=None, filter=None):
lastIdentifier = ""
while not self.terminated():
qs = model.objects.filter(identifier__gt=lastIdentifier).order_by("identifier")
if only is not None:
qs = qs.only(*only)
qs = list(qs[: self.opt.pagesize])
if len(qs) == 0:
break
for o in qs:
if filter is None or filter(o):
yield o
lastIdentifier = qs[-1].identifier
yield None
started proc-link-checker-update a few times on ezid-dev with default page size (100,000)
started proc-link-checker-update on ezid-dev with a smaller page size (10,000)
screen ./manage.py proc-link-checker-update --pagesize 10000 --debug >/ezid/logs/background_jobs/link_checker_update_size_10000.log 2>&1
Testing updated proc-link-checker-update on ezid-dev with a smaller default page size (10,000)
Testing updated proc-link-checker-update on ezid-dev with page size removed
Testing updated proc-link-checker-update on ezid-dev pagesize=10K
pagesize=1K
pagesize=100K
Update proc-link-checker-update
Conclusion:
Updated code with default pagesize=10K and without list:
def _harvest(self, model, only=None, filter=None):
lastIdentifier = ""
batch = 0
while not self.terminated():
qs = model.objects.filter(identifier__gt=lastIdentifier).order_by("identifier")
if only is not None:
qs = qs.only(*only)
qs = qs[: self.opt.pagesize]
identifier = None
for o in qs:
if filter is None or filter(o):
identifier = o.identifier
yield o
if identifier is None:
break
else:
lastIdentifier = identifier
batch += 1
print(f"#### batch: {batch}, lastID: {lastIdentifier}")
yield None
Run log with pagesize=10k,
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7941/S9NK89
Processing %s doi:10.7941/S9NK89
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7941/S9T92T
Processing %s doi:10.7941/S9T92T
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7941/S9Z03B
Processing %s doi:10.7941/S9Z03B
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/320620140320
Processing %s doi:10.7942/320620140320
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2008W
Processing %s doi:10.7942/C2008W
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C21T0F
Processing %s doi:10.7942/C21T0F
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C25H2B
Processing %s doi:10.7942/C25H2B
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2894Z
Processing %s doi:10.7942/C2894Z
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C29660
Processing %s doi:10.7942/C29660
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2D08M
Processing %s doi:10.7942/C2D08M
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2F08X
Processing %s doi:10.7942/C2F08X
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2H59N
Processing %s doi:10.7942/C2H59N
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2JQ02
Processing %s doi:10.7942/C2JQ02
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2MW25
Processing %s doi:10.7942/C2MW25
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2PH20
Processing %s doi:10.7942/C2PH20
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2RP43
Processing %s doi:10.7942/C2RP43
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2WC7C
Processing %s doi:10.7942/C2WC7C
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/GOSHIP-EASYOCEAN
Processing %s doi:10.7942/GOSHIP-EASYOCEAN
#### batch: 2763, lastID: doi:10.7942/GOSHIP-EASYOCEAN
DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Closing DB connections and sleeping for 36341.09s...
Run log with pagesize=1K:
Processing %s doi:10.7942/C29660
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Processing doi:10.7942/C2D08M
Processing %s doi:10.7942/C2D08M
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Processing doi:10.7942/C2F08X
Processing %s doi:10.7942/C2F08X
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Processing doi:10.7942/C2H59N
Processing %s doi:10.7942/C2H59N
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Processing doi:10.7942/C2JQ02
Processing %s doi:10.7942/C2JQ02
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Processing doi:10.7942/C2MW25
Processing %s doi:10.7942/C2MW25
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Processing doi:10.7942/C2PH20
Processing %s doi:10.7942/C2PH20
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Processing doi:10.7942/C2RP43
Processing %s doi:10.7942/C2RP43
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Processing doi:10.7942/C2WC7C
Processing %s doi:10.7942/C2WC7C
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Processing doi:10.7942/GOSHIP-EASYOCEAN
Processing %s doi:10.7942/GOSHIP-EASYOCEAN
#### batch: 27624, lastID: doi:10.7942/GOSHIP-EASYOCEAN
DEBUG ezidapp.management.commands.proc-link-checker-update 140500543045632: Closing DB connections and sleeping for 72261.23s...
Run log with pagesize=100k:
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C2894Z
Processing %s doi:10.7942/C2894Z
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C29660
Processing %s doi:10.7942/C29660
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C2D08M
Processing %s doi:10.7942/C2D08M
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C2F08X
Processing %s doi:10.7942/C2F08X
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C2H59N
Processing %s doi:10.7942/C2H59N
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C2JQ02
Processing %s doi:10.7942/C2JQ02
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C2MW25
Processing %s doi:10.7942/C2MW25
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C2PH20
Processing %s doi:10.7942/C2PH20
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C2RP43
Processing %s doi:10.7942/C2RP43
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/C2WC7C
Processing %s doi:10.7942/C2WC7C
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Processing doi:10.7942/GOSHIP-EASYOCEAN
Processing %s doi:10.7942/GOSHIP-EASYOCEAN
#### batch: 277, lastID: doi:10.7942/GOSHIP-EASYOCEAN
DEBUG ezidapp.management.commands.proc-link-checker-update 140031651635200: Closing DB connections and sleeping for 14965.73s...
Retesting the original script proc-link-checker-update on ezid-dev with default page size (100,000) pagesize=100K
pagesize=10K
pagesize=1K
Run log pagesize=100K
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C21T0F
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C25H2B
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C2894Z
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C29660
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C2D08M
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C2F08X
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C2H59N
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C2JQ02
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C2MW25
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C2PH20
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C2RP43
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/C2WC7C
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Processing doi:10.7942/GOSHIP-EASYOCEAN
DEBUG ezidapp.management.commands.proc-link-checker-update-org 139893872689152: Closing DB connections and sleeping for 83420.41s...
Run log for pagesize=10K
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C29660
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2D08M
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2F08X
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2H59N
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2JQ02
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2MW25
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2PH20
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2RP43
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2WC7C
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/GOSHIP-EASYOCEAN
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Closing DB connections and sleeping for 49098.98s...
page=1K log:
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C21T0F
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C25H2B
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2894Z
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C29660
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2D08M
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2F08X
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2H59N
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2JQ02
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2MW25
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2PH20
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2RP43
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2WC7C
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/GOSHIP-EASYOCEAN
DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Closing DB connections and sleeping for 24267.43s...
Summary of findings: specs were taken from the .librato RDS metrics current script: extract pagesize (default=100K) records once into memory, then iterating through each record; update script: use RDS and Django default setting to handle records retrieving and iterating
Sample Queries used by the proc-link-checker-update job :
SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`owner_id`, `ezidapp_searchidentifier`.`createTime`, `ezidapp_searchidentifier`.`isTest`, `ezidapp_searchidentifier`.`hasMetadata` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/pqww0100' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 10000
SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`linkIsBroken` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/zxwg0138' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 100000
Link-checker-update performance on 2024-08-21 (with default pagesize 100K):
The 1st Query group with Avg latency 3145.41, examined rows 9996.81:
SELECT `ezidapp_searchidentifier` . `id` , `ezidapp_searchidentifier` . `identifier` , `ezidapp_searchidentifier` . `owner_id` , `ezidapp_searchidentifier` . `createTime` , `ezidapp_searchidentifier` . `isTest` , `ezidapp_searchidentifier` . `hasMetadata` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier` . `identifier` > ? ORDER BY `ezidapp_searchidentifier` . `identifier` ASC LIMIT ?
SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`owner_id`, `ezidapp_searchidentifier`.`createTime`, `ezidapp_searchidentifier`.`isTest`, `ezidapp_searchidentifier`.`hasMetadata` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/qrjk0127' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 10000
The 2nd query group with Ave latency 30154.94, examined rows 99493.59
SELECT `ezidapp_searchidentifier` . `id` , `ezidapp_searchidentifier` . `identifier` , `ezidapp_searchidentifier` . `linkIsBroken` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier` . `identifier` > ? ORDER BY `ezidapp_searchidentifier` . `identifier` ASC LIMIT ?
SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`linkIsBroken` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/yxcl0265' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 100000
Active sessions are reaching the maximum CPU (Max vCPUs =2) while the proc-link-checker-update job are running. The 1st query group is in purple and the 2nd query group is in orange.
Link-checker-update performance by query (PRD), 8/29 1-4am PDT
Sample queries: link-checker-upd_1_purple.sql
SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`owner_id`, `ezidapp_searchidentifier`.`createTime`, `ezidapp_searchidentifier`.`isTest`, `ezidapp_searchidentifier`.`hasMetadata` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/fpjg0058' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 10000
link-checker-upd_2_org.sql
SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`linkIsBroken` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/yyfp0198' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 100000
link-checker-upd_3_light_purple.sql
SELECT `ezidapp_linkchecker`.`id`, `ezidapp_linkchecker`.`identifier`, `ezidapp_linkchecker`.`numFailures` FROM `ezidapp_linkchecker` WHERE `ezidapp_linkchecker`.`identifier` > 'ark:/88122/nqlb0020' ORDER BY `ezidapp_linkchecker`.`identifier` ASC LIMIT 100000
link-checker-upd_4_pink.sql
UPDATE `ezidapp_linkchecker` SET `identifier` = 'ark:/28722/k2fx7pt6j', `owner_id` = 152, `target` = 'http://opencontext.org/subjects/668a4cb8-e1f0-415c-82a8-f3f6edf1108b', `lastCheckTime` = 1724923248, `numFailures` = 0, `isBad` = 0, `returnCode` = 200, `error` = '', `mimeType` = 'text/html; charset=utf-8', `size` = 476640, `hash` = '7c672ebfc0088e736ba3ee0426b1f647' WHERE `ezidapp_linkchecker`.`id` = 28697528
Smaller pagesize (10K) performed better:
RDS metrics
Read IOPS is lower : 1.4-3k vs 1.3-2.5K
Disk queue depth is lower: 0.8-4.5 vs 0.8-1.5
Database performance:
Calls/sec improved from: 0.01 to 0.14
Avg latency decreased from 47450 to 4519
See detailed log in Google doc: Test Link-checker-update report on EZID-STG
See detailed log in Google Drive: Link-Checker-Update_Prd_Performance_Comparison
Analyzed the Metrics dashboard in RDS Performance Insights on 2024-09-09 with default page size=10K
CUP utilization (percent): total is over 90% (very high)
EBS IO operations (per second): not reached to limit
Note from AWS on CPU utilization: Examine the os.cpuUtilization.nice.avg metric in Performance Insights. If this value is far less than the CPU usage, nondatabase processes are the main contributor to CPU. (https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/apg-waits.cpu.html)
SQL query processing on EC2 might be the nondatabase processes that have contributed to high CPU wait on RDS. Get statistics on ezid-stg for analysis.
New findings:
the purple one is from proc-stats
; here is a sample query:
SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`,
`ezidapp_searchidentifier`.`owner_id`, `ezidapp_searchidentifier`.`createTime`,
`ezidapp_searchidentifier`.`isTest`, `ezidapp_searchidentifier`.`hasMetadata` FROM
`ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/fpjg0058' ORDER
BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 10000
the orange one is from proc-link-checker-update
; here is a sample query:
SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`,
`ezidapp_searchidentifier`.`linkIsBroken` FROM `ezidapp_searchidentifier` WHERE
`ezidapp_searchidentifier`.`identifier` > 'ark:/88122/yyfp0198' ORDER BY
`ezidapp_searchidentifier`.`identifier` ASC LIMIT 100000
Options:
Schedule the link-checker-update report to run from 3pm
The start time is defined by the settings.LINKCHECKER_RESULTS_UPLOAD_CYCLE
which is 3600 sec (1 hour) after midnight. If we want to start the job at 3pm, the LINKCHECKER_RESULTS_UPLOAD_CYCLE
value should be set to 15 x3600=54,000 sec.
The function that used to calculate time to sleep is _sameTimeOfDayDelta
self.resultsUploadCycle = django.conf.settings.LINKCHECKER_RESULTS_UPLOAD_CYCLE
def _sameTimeOfDayDelta(self):
now = datetime.datetime.now()
midnight = now.replace(hour=0, minute=0, second=0, microsecond=0)
d = self.resultsUploadCycle - (now - midnight).total_seconds()
if d < 0:
d += 86400
return d
The proc-link-checker-update background job is memory intensive on the RDS server. The core SQL query is identified as one of the long running queries. Refactor the process to make the SQL queries more efficient. Or consider to use OpenSearch as data sources