CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

Refactor proc-link-checker-update #676

Open jsjiang opened 2 months ago

jsjiang commented 2 months ago

The proc-link-checker-update background job is memory intensive on the RDS server. The core SQL query is identified as one of the long running queries. Refactor the process to make the SQL queries more efficient. Or consider to use OpenSearch as data sources

jsjiang commented 1 month ago

The proc-link-checker-update job runs everyday starting from 1am and lasts 2-3 hours. This process is resource intensive on the RDS. We have noticed resource bottlenecks everyday between 1:30-3:30am

The core SQL statements which is created by the _harvest function to retrieve records from the searchIdentifier and the linkChecker table may need refactoring to improve performance.

Refactor options:

Current code:

    def _harvest(self, model, only=None, filter=None):
        lastIdentifier = ""
        while not self.terminated():
            qs = model.objects.filter(identifier__gt=lastIdentifier).order_by("identifier")
            if only is not None:
                qs = qs.only(*only)
            qs = list(qs[: self.opt.pagesize])
            if len(qs) == 0:
                break
            for o in qs:
                if filter is None or filter(o):
                    yield o
            lastIdentifier = qs[-1].identifier
        yield None
jsjiang commented 1 month ago

started proc-link-checker-update a few times on ezid-dev with default page size (100,000)

started proc-link-checker-update on ezid-dev with a smaller page size (10,000)

screen ./manage.py proc-link-checker-update --pagesize 10000 --debug >/ezid/logs/background_jobs/link_checker_update_size_10000.log 2>&1
jsjiang commented 1 month ago

Testing updated proc-link-checker-update on ezid-dev with a smaller default page size (10,000)

jsjiang commented 1 month ago

Testing updated proc-link-checker-update on ezid-dev with page size removed

jsjiang commented 1 month ago

Testing updated proc-link-checker-update on ezid-dev pagesize=10K

pagesize=1K

pagesize=100K

Update proc-link-checker-update

Conclusion:

Updated code with default pagesize=10K and without list:

   def _harvest(self, model, only=None, filter=None):
        lastIdentifier = ""
        batch = 0
        while not self.terminated():
            qs = model.objects.filter(identifier__gt=lastIdentifier).order_by("identifier")
            if only is not None:
                qs = qs.only(*only)
            qs = qs[: self.opt.pagesize]

            identifier = None
            for o in qs:
                if filter is None or filter(o):
                    identifier = o.identifier
                    yield o
            if identifier is None:
                break
            else:
                lastIdentifier = identifier
                batch += 1
                print(f"#### batch: {batch}, lastID: {lastIdentifier}")
        yield None

Run log with pagesize=10k,

   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7941/S9NK89
Processing %s doi:10.7941/S9NK89
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7941/S9T92T
Processing %s doi:10.7941/S9T92T
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7941/S9Z03B
Processing %s doi:10.7941/S9Z03B
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/320620140320
Processing %s doi:10.7942/320620140320
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2008W
Processing %s doi:10.7942/C2008W
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C21T0F
Processing %s doi:10.7942/C21T0F
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C25H2B
Processing %s doi:10.7942/C25H2B
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2894Z
Processing %s doi:10.7942/C2894Z
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C29660
Processing %s doi:10.7942/C29660
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2D08M
Processing %s doi:10.7942/C2D08M
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2F08X
Processing %s doi:10.7942/C2F08X
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2H59N
Processing %s doi:10.7942/C2H59N
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2JQ02
Processing %s doi:10.7942/C2JQ02
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2MW25
Processing %s doi:10.7942/C2MW25
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2PH20
Processing %s doi:10.7942/C2PH20
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2RP43
Processing %s doi:10.7942/C2RP43
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/C2WC7C
Processing %s doi:10.7942/C2WC7C
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Processing doi:10.7942/GOSHIP-EASYOCEAN
Processing %s doi:10.7942/GOSHIP-EASYOCEAN
#### batch: 2763, lastID: doi:10.7942/GOSHIP-EASYOCEAN
   DEBUG ezidapp.management.commands.proc-link-checker-update 139832254058496: Closing DB connections and sleeping for 36341.09s...

Run log with pagesize=1K:

Run log with pagesize=100k:

jsjiang commented 1 month ago

Retesting the original script proc-link-checker-update on ezid-dev with default page size (100,000) pagesize=100K

pagesize=10K

pagesize=1K

Run log for pagesize=10K

   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C29660
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2D08M
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2F08X
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2H59N
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2JQ02
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2MW25
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2PH20
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2RP43
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/C2WC7C
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Processing doi:10.7942/GOSHIP-EASYOCEAN
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140632845991936: Closing DB connections and sleeping for 49098.98s...

page=1K log:

  DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C21T0F
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C25H2B
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2894Z
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C29660
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2D08M
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2F08X
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2H59N
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2JQ02
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2MW25
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2PH20
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2RP43
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/C2WC7C
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Processing doi:10.7942/GOSHIP-EASYOCEAN
   DEBUG ezidapp.management.commands.proc-link-checker-update-org 140153041125376: Closing DB connections and sleeping for 24267.43s...
jsjiang commented 1 month ago

Summary of findings: specs were taken from the .librato RDS metrics current script: extract pagesize (default=100K) records once into memory, then iterating through each record; update script: use RDS and Django default setting to handle records retrieving and iterating

Link checker update performance analysis

jsjiang commented 3 weeks ago

Sample Queries used by the proc-link-checker-update job :

SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`owner_id`, `ezidapp_searchidentifier`.`createTime`, `ezidapp_searchidentifier`.`isTest`, `ezidapp_searchidentifier`.`hasMetadata` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/pqww0100' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 10000
SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`linkIsBroken` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/zxwg0138' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 100000
jsjiang commented 3 weeks ago

Link-checker-update performance on 2024-08-21 (with default pagesize 100K):

link-checker-update-2024-0821

The 1st Query group with Avg latency 3145.41, examined rows 9996.81:

SELECT `ezidapp_searchidentifier` . `id` , `ezidapp_searchidentifier` . `identifier` , `ezidapp_searchidentifier` . `owner_id` , `ezidapp_searchidentifier` . `createTime` , `ezidapp_searchidentifier` . `isTest` , `ezidapp_searchidentifier` . `hasMetadata` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier` . `identifier` > ? ORDER BY `ezidapp_searchidentifier` . `identifier` ASC LIMIT ?

SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`owner_id`, `ezidapp_searchidentifier`.`createTime`, `ezidapp_searchidentifier`.`isTest`, `ezidapp_searchidentifier`.`hasMetadata` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/qrjk0127' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 10000

The 2nd query group with Ave latency 30154.94, examined rows 99493.59

SELECT `ezidapp_searchidentifier` . `id` , `ezidapp_searchidentifier` . `identifier` , `ezidapp_searchidentifier` . `linkIsBroken` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier` . `identifier` > ? ORDER BY `ezidapp_searchidentifier` . `identifier` ASC LIMIT ?

SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`linkIsBroken` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/yxcl0265' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 100000
jsjiang commented 2 weeks ago

Active sessions are reaching the maximum CPU (Max vCPUs =2) while the proc-link-checker-update job are running. The 1st query group is in purple and the 2nd query group is in orange.

Image

jsjiang commented 2 weeks ago

Link-checker-update performance by query (PRD), 8/29 1-4am PDT

Image

Sample queries: link-checker-upd_1_purple.sql

SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`owner_id`, `ezidapp_searchidentifier`.`createTime`, `ezidapp_searchidentifier`.`isTest`, `ezidapp_searchidentifier`.`hasMetadata` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/fpjg0058' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 10000

link-checker-upd_2_org.sql

SELECT `ezidapp_searchidentifier`.`id`, `ezidapp_searchidentifier`.`identifier`, `ezidapp_searchidentifier`.`linkIsBroken` FROM `ezidapp_searchidentifier` WHERE `ezidapp_searchidentifier`.`identifier` > 'ark:/88122/yyfp0198' ORDER BY `ezidapp_searchidentifier`.`identifier` ASC LIMIT 100000

link-checker-upd_3_light_purple.sql

SELECT `ezidapp_linkchecker`.`id`, `ezidapp_linkchecker`.`identifier`, `ezidapp_linkchecker`.`numFailures` FROM `ezidapp_linkchecker` WHERE `ezidapp_linkchecker`.`identifier` > 'ark:/88122/nqlb0020' ORDER BY `ezidapp_linkchecker`.`identifier` ASC LIMIT 100000

link-checker-upd_4_pink.sql

UPDATE `ezidapp_linkchecker` SET `identifier` = 'ark:/28722/k2fx7pt6j', `owner_id` = 152, `target` = 'http://opencontext.org/subjects/668a4cb8-e1f0-415c-82a8-f3f6edf1108b', `lastCheckTime` = 1724923248, `numFailures` = 0, `isBad` = 0, `returnCode` = 200, `error` = '', `mimeType` = 'text/html; charset=utf-8', `size` = 476640, `hash` = '7c672ebfc0088e736ba3ee0426b1f647' WHERE `ezidapp_linkchecker`.`id` = 28697528
jsjiang commented 2 weeks ago

See detailed log in Google doc: Test Link-checker-update report on EZID-STG

jsjiang commented 1 week ago

Image

See detailed log in Google Drive: Link-Checker-Update_Prd_Performance_Comparison

jsjiang commented 6 days ago

Analyzed the Metrics dashboard in RDS Performance Insights on 2024-09-09 with default page size=10K

Note from AWS on CPU utilization: Examine the os.cpuUtilization.nice.avg metric in Performance Insights. If this value is far less than the CPU usage, nondatabase processes are the main contributor to CPU. (https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/apg-waits.cpu.html)

Image

jsjiang commented 6 days ago

SQL query processing on EC2 might be the nondatabase processes that have contributed to high CPU wait on RDS. Get statistics on ezid-stg for analysis.

jsjiang commented 5 days ago

New findings:

  1. The top 2 queries (purple and orange) are from two back ground jobs which are scheduled to start at the same time ( one hour past middle night)
  1. While these two jobs use different strategies to retrieve all records from the searchIdentifier table, the total processing time is very close. This indicates the size of the SearchIdentifier table (or total records count) might be a big contributor to the total processing time.

Options:

jsjiang commented 5 days ago

Schedule the link-checker-update report to run from 3pm

self.resultsUploadCycle = django.conf.settings.LINKCHECKER_RESULTS_UPLOAD_CYCLE

    def _sameTimeOfDayDelta(self):
        now = datetime.datetime.now()
        midnight = now.replace(hour=0, minute=0, second=0, microsecond=0)
        d = self.resultsUploadCycle - (now - midnight).total_seconds()
        if d < 0:
            d += 86400
        return d