bcgov / DITP-DevOps

Digital Identity and Trust Program Team's DevOps Documentation Repository
Apache License 2.0
2 stars 5 forks source link

OrgBook - Investigate periodic spikes in search engine CPU usage #150

Closed WadeBarnes closed 8 months ago

WadeBarnes commented 9 months ago

Starting January 3rd 2024 - 4:20pm we started to experience CPU spikes to 60% every four hours. These spikes do not appear to be related to query volume, rather a specific query pattern which is different than the known and blocked one.

Image

These spikes are continuing to occur on a periodic basis.

Note the level of the spikes is reduced due to the down sampling of the wider time scale in this graph: Image

Image

Investigate and identify the query pattern responsible for these spikes and determine if it's a pattern we should be blocking. For the moment this activity is not adversely affecting performance or service levels.

WadeBarnes commented 8 months ago

A closer look at one of the recent spikes: image image

WadeBarnes commented 8 months ago

The queries during these periods appear to be performing paging to the level we allow. I've adjusted the expression filtering to limit things a bit more to see if that helps.

WadeBarnes commented 8 months ago

The changes helped narrow the window: Image Image

WadeBarnes commented 8 months ago

The load is definitely related to paging queries.

swcurran commented 8 months ago

Almost certainly, these issues are coming from a known entity to BC Gov (I’ll leave off the name), who is trying to use OrgBook as a way to maintain s full list of all BC registred entities. They are meeting with BC Registries tomorrow to talk about how to get the information in other ways. Since they want to maintain their database to be as accurate as possible, I’m sure they will continue to work on how to scrape the data.

WadeBarnes commented 8 months ago

The IP addresses associated to these particular queries are globally distributed with repeated queries for the same pages, so I'm skeptical that's the source. I've been tracking another query pattern that I suspect is related to what you're talking about, but that query pattern does not use paging, and therefore does not put much load on the search engine. It looks like this:

image

WadeBarnes commented 8 months ago

OrgBook Traffic over the past week: image

Filtering out the traffic from the sync above and one other query traffic pattern I'm following our typical week looks like this: image

Even when you factor in all of the queries the load on the search engine is very low. The only spikes we are seeing is from the paging queries which make up a tiny fraction of the overall traffic. When used "correctly" the OrgBook could handle a much higher volume of synchronization traffic. There's also the notification webhooks that can be used to subscribe to change notifications.

WadeBarnes commented 8 months ago

The adjustments to the expression filtering greatly reduced the duration of the CPU spikes, and you can see this from the 1D view here. It shows up as a reduction in CPU use since the data is down sampled. image

Zooming in you can see this is just a reduction in the duration: image image

The traffic over this period consisted of 167 requests all but one set of a few queries (which looked to be a legitimate query) were paging queries from globally distributed IPs.

Interestingly, the filtering updates appear to have detoured some of the unwanted traffic. Though we've seen this traffic drop off and pick up again before, the reduction correlates with the updates to the expression filtering quite well. image

WadeBarnes commented 8 months ago

Additional query patterns have been identified and added to the expression filtering.

WadeBarnes commented 8 months ago

The updates to the expression filters got things under control.