atsign-foundation / at_server

The software implementation of Atsign's core technology
https://docs.atsign.com
BSD 3-Clause "New" or "Revised" License
40 stars 12 forks source link

Investigate and smooth out load spikes on prod worker nodes #482

Open gkc opened 2 years ago

gkc commented 2 years ago

Lead: @murali-shris

Describe the bug There are periodic large load spikes which correspond with scheduled jobs (compaction, scans, ...) which are straining our worker nodes

Expected outcome

murali-shris commented 2 years ago

@cconstab @cpswan Can you please grant me access to the required monitoring dashboards to help me investigate this issue

murali-shris commented 2 years ago

Two possible causes of CPU spike

murali-shris commented 2 years ago

merged PR to randomise hive expiry check https://github.com/atsign-foundation/at_server/pull/497

sitaram-kalluri commented 2 years ago

Moving the task to next sprint (PR-30) to validate the performance once the changes are deployed.

murali-shris commented 2 years ago

@cconstab @cpswan Is this issue still occurring in prod? any work to be done in the upcoming sprint related to load spikes?

cconstab commented 2 years ago

Will take a look or @cpswan

cpswan commented 2 years ago

@murali-shris things are a lot better, but I'm still seeing some hourly spikes, so maybe another scheduled job elsewhere in the secondary?

ksanty commented 2 years ago

@murali-shris @cpswan should we move priority up to high for PR43 sprint planning?

murali-shris commented 2 years ago

@murali-shris @cpswan should we move priority up to high for PR43 sprint planning?

yes @ksanty ..we can revisit whether prod spike still exists