CCI-MOC / ops-issues

2 stars 0 forks source link

Create Communication and Status update for downtime on July 10th #1032

Closed joachimweyl closed 1 year ago

joachimweyl commented 1 year ago

Draft

Subject: MOC Alliance outage July 10th - 11th, 2023

We will be performing upgrades to some of our hardware. Most of our services will be down for a large part of July 10th, 2023. This upgrade has the potential of extending to the 11th.

We will notify you when MOC services are available by updating the MOC status website.

Please be aware of the following outages:

  1. NEU Cache Servers
  2. SSO service, this means you will be unable to authenticate
  3. Kaizen HIL/BMI service
    1. If your system is booted using BMI it will need to be powered off because the iSCSI servers will be powered off and moved.
  4. ESI development servers will be cut off (kumo environment)
  5. ESI production servers
  6. Research Ceph Cluster
  7. Due to Network interdependencies other systems may also be unreachable

Once services are back online, you will be responsible for restarting any containers or other systems.

If you need access to any MOC-hosted data during this outage, please obtain copies of your data before Monday, July 10th. During the outage, the data center will be entirely without power, and access to MOC-hosted services will be impossible.

As always, if you have questions, open a support ticket. The ticketing system will be available throughout the outage.

naved001 commented 1 year ago

Things that will be affected:

  1. While our openshift (moc-prod) will be up and running, the SSO service is on the 129.10.5.0/24 network, so people won't be able to authenticate.

  2. The Kaizen HIL/BMI service will be down because it's on the same network again, which also means most of their baremetal nodes will become inaccessible.

  3. Access to ESI development servers will be cut off (kumo environment) as the gateway will be down.

  4. Access to research ceph cluster will be down (because people access it via the kaizen hil/bmi or kumo gateways).

I don't know what other services do we offer to users? But it looks like meaningful access to most services will be cut off.

For us, our ipmi network will go down so no out of band access to anything while we move the networks.

joachimweyl commented 1 year ago

Status page update for this topic.

joachimweyl commented 1 year ago
msdisme commented 1 year ago

@naved001 Is the assumption that it will just be on the 10th?

naved001 commented 1 year ago

@msdisme depends, much of the outage will be caused by the public network becoming unavailable. If we move that first, then things will go down in small batches.

joachimweyl commented 1 year ago

@msdisme updated to include the possibility of the 11th. Letting them know that the status page will be updated to let them know if it is resolved.

joachimweyl commented 1 year ago

sent at 4:08