emory-libraries / blacklight-catalog

1 stars 2 forks source link

Investigate spike in CPU usage for Blacklight Prod #1348

Closed abelemlih closed 1 year ago

abelemlih commented 1 year ago

A spike in CPU usage for Blacklight Prod has been causing AWS alarms. In this ticket, I plan to investigate the root cause of the spike and propose solutions to fix the underlying issue.

For reference, the following are Brad's findings regarding the spike:

I’ve been digging into why we are seeing spikes in CPU usage in Blacklight prod. It seems that we have a daily process of UpdateCollectionAuthorityService that completely wipes a table that contains 330k objects and rebuilds them. This happens everyday at 3am server time.

Possible solutions:

  1. reduce the number of days we run this per week.
  2. refactor the method to not delete all of the authorities, but instead check for existing authorities that are no longer listed in the Solr response and delete those. To ensure that new collections have an authority created, we can check if the Solr response doesn’t have a match in the authorities.
abelemlih commented 1 year ago

@rotated8 @kbowaterskelly @jcrompton42 I have been looking into the CPU spike issue, and I don’t think there is an underlying issue in the application. From CloudWatch, it’s clear that the spike is sudden starting April 20th and ending May 2nd. Looking at errors in Honeybadger, I am seeing various instances of errors that were tied to an online marketing bot. It could be that they were scraping our site for an extended period of time.

As for the daily cron job UpdateCollectionAuthorityService , the stats in CloudWatch indicate that this spikes are sudden for our application, and over an extended period of time, which makes it unlikely that the service is causing any interruption. I believe if it were the case, we would have seen this issue occur a while ago, and it would not have stopped. Currently, there is no spike in CPU usage, which suggests that the scraping by the bot has concluded.