GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
620 stars 99 forks source link

O+M 2024-10-14 #4933

Open btylerburton opened 3 days ago

btylerburton commented 3 days ago

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Weekly Checklist

Monthly Checklist

ad-hoc checklist

Reference

FuhuXia commented 8 hours ago

From the previous observation, we kind of knew solr mem-leak/restart is related to harvesting activity. Now we have more evidences.

Comparing a week of data on solr memory usage and harvest activity (from by catalog-fetch log), it shows a spike of harvesting activity is always followed by a solr memory usage increase.

So if we can control the harvesting activity and make harvesting only happen during off hours, we can control when solr restart will happen, therefore minimize the catalog down time during business hours.

image image