GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
547 stars 87 forks source link

O+M 2024-06-24 #4799

Open FuhuXia opened 6 days ago

FuhuXia commented 6 days ago

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Weekly Checklist

Monthly Checklist

ad-hoc checklist

Reference

FuhuXia commented 6 days ago

NewRelic shows catalog-web performed better after https://github.com/GSA/data.gov/issues/4708.

For example:

transaction time reduced from ~1100ms to ~400 ms:

Image

Apdex score:

image

Web request timeout percentage dropped from ~1% to ~0.1%:

image

POSTGRES DB tracking_summary SELECT query throughput:

image

tracking_summary SELECT query dropped from the top to no.6 in most time consuming ranking:

image

FuhuXia commented 6 days ago

Set the following harvest source to manual schedule until source url is fixed.

FuhuXia commented 6 days ago

Starting from June 5, 2024, our harvest agent is blocked by Institute of Museum and Library Services' web server, harvest source /harvest/imls-json can not be harvested.

hkdctol commented 6 days ago

Have reached out to the contact addresses I have for IMLS.

FuhuXia commented 4 days ago

Since 2024-06-25 03:50 EDT googlebot started to send nonsense traffic to catalog, doubling the total requests catalog receives, and doubling the catalog-web CPU usage. If this trends continues, we might have to block certain traffic based on the request pattern. Details in slack discussion.

FuhuXia commented 4 days ago

Reduced prod catalog-web instances from 5 to 3. Mem from 850M to 800M. The following two PRs save us 2050M memory.

FuhuXia commented 3 days ago

Change harvest sources their original schedules for those that were paused due to repeated ParentNotHarvestedException error.