Investigate and smooth out load spikes on prod worker nodes

atsign-foundation / at_server

The software implementation of Atsign's core technology

https://docs.atsign.com

BSD 3-Clause "New" or "Revised" License

40 stars 12 forks source link

Investigate and smooth out load spikes on prod worker nodes #482

Open gkc opened 2 years ago

gkc commented 2 years ago

Lead: @murali-shris

Describe the bug There are periodic large load spikes which correspond with scheduled jobs (compaction, scans, ...) which are straining our worker nodes

Expected outcome

Shared documented understanding of which scheduled jobs are driving load
Adjusted jobs schedules to spread load ~evenly over time

murali-shris commented 2 years ago

@cconstab @cpswan Can you please grant me access to the required monitoring dashboards to help me investigate this issue

murali-shris commented 2 years ago

Two possible causes of CPU spike

sec check which runs every 15 mins. created inbound connection to secondary and runs scan
hive expiry check which is a scheduled job on server which runs every 10 mins.This scans every key in hive and checks for expiry

murali-shris commented 2 years ago

merged PR to randomise hive expiry check https://github.com/atsign-foundation/at_server/pull/497

sitaram-kalluri commented 2 years ago

Moving the task to next sprint (PR-30) to validate the performance once the changes are deployed.

murali-shris commented 2 years ago

@cconstab @cpswan Is this issue still occurring in prod? any work to be done in the upcoming sprint related to load spikes?

cconstab commented 2 years ago

Will take a look or @cpswan

cpswan commented 2 years ago

@murali-shris things are a lot better, but I'm still seeing some hourly spikes, so maybe another scheduled job elsewhere in the secondary?

ksanty commented 2 years ago

@murali-shris @cpswan should we move priority up to high for PR43 sprint planning?

murali-shris commented 2 years ago

@murali-shris @cpswan should we move priority up to high for PR43 sprint planning?

yes @ksanty ..we can revisit whether prod spike still exists