NaturalHistoryMuseum / scratchpads2

Scratchpads 2.0
http://scratchpads.org
GNU General Public License v2.0
199 stars 83 forks source link

Analyse why multiple sites temporary outage today - sites recovered by themselves - but please investigate for future #6566

Closed therobyouknow closed 2 years ago

therobyouknow commented 2 years ago

also happened on Wed 29 Jun. Ongoing issue.

logs show:

looks like the scratchpads site outage is due to a problem with the database servers

[Wed Jun 29 11:30:38.791916 2022] [php7:notice] [pid 21714] [client 157.140.2.32:36898] PHP Notice: Undefined index: port in /var/aegir/config/includes/databases.inc on line 13, referer: https://vbrant.scratchpads.org/calendar-date/2021-11-12?destination=forum%2F2

email from monit to say its trying to restart mysqld on sp-data-03.nhm.ac.uk

therobyouknow commented 2 years ago

sites back up. Thanks to Ben for helping here.

Will record some findings to help avoid future similar.

therobyouknow commented 2 years ago

From our infrastructure team (TS):

Yesterday [Tuesday 28 June 2022] there was an issue with the NHM storage solution which caused many of our NFS clients lose access to storage mounts and as a result the scratchpad servers would have errored.

The service came back online around 15:30 yesterday and is stable, please le me know if you have any recent issues.

The above would explain the issue.

I would think that expectation is that it is not a recurring issue and also there isn't anything ourselves as scratchpads maintainers would need to do.