Should we relocate to utilize UPS for our storage cluster(s)?

larsks commented 3 years ago

After the latest MGHPCC power outage, our production Ceph cluster is unhealthy and the OpenStack S3 endpoint is inoperative. This isn't the first time a power outage has taken out our storage. Should we invest in a UPS for the storage cluster?

larsks commented 3 years ago

From slack discussion:

5:44 PM
pjd Re: UPS, the Ceph cluster is in the NU racks, right?
5:44 PM
I could ask about moving it to one of the racks on the other side of the aisle, which have backup power.
5:44 PM
All even # racks in aisle 3 pod B have backup
6:02 PM
larsks Something to talk about on Thursday, I guess. I'm going to be out of the office tomorrow.
6:22 PM
msd +1
pjd
All even # racks in aisle 3 pod B have backup
Posted in coredev | Mar 30th | View message
6:27 PM
naved001 well, all our stuff is in the odd racks. and our ceph cluster is distributed across all racks.
6:27 PM
msd yep
6:36 PM
msd assuming you saw the info on research ceph slack channeL?
9:59 PM
naved001 I did see that the research ceph is unhealthy, but that wiill have to wait.
10:55 PM
msd yup
11:36 PM
pjd @naved001 no guarantees, but I’d be willing to do some politicking to see if I could get us half a rack on the even side.

msdisme commented 3 years ago

@pjd-nu any feedback from NEU about swapping cages?

msdisme commented 3 years ago

need to find out if there is a way to add UPS to existing non-UPS racks, cost structure, space impact?

msdisme commented 3 years ago

@pjd-nu and @okrieg is the assumption that in the future the researchers will be responsible for managing the research cephs?

pjd-nu commented 3 years ago

Northeastern has cleared out an even-side rack with backup power that we can use for production Ceph. Ping me if I don't update this with more details...

larsks commented 3 years ago

@pjd-nu I'm pinging you as requested for more details :smile:

naved001 commented 3 years ago

@pjd-nu pinging you again for more details.

We have 10 OSD servers (2U per server), and 3 monitors (1U each). That's a total of 23U.

We would need to put a 10G switch with 40g uplinks, and a 1G IPMI switch.

With that we should be able to migrate all the ceph nodes to the new rack.

naved001 commented 3 years ago

Just noting down what we talked about it:

Finalize the rack, make sure we have access to it etc.
Setup a 1G switch there for IPMI and some 10G switch for the ceph nodes. Do all the networking bits.
Once that's done, we can move one OSD server at a time instead of migrating all of the nodes at once.

We don't have to do any of this during the shutdown. We would want to keep things as is during the shutdown since we already have a failing brocade switch, and more things may break during the poweroff/poweron process.

pjd-nu commented 3 years ago

I've always wanted to try using dual power supplies to move a server without turning it off :-)

msdisme commented 1 year ago

@naved001 can you work with @hakasapl and team to come up with a proposal for an MOC ceph cluster that we want to maintain post kaizen shutdown.

msdisme commented 1 year ago

@pjd-nu finally we are ready to explore moving ceph to the even side - is it just a single rack and if so can you confirm the location? Do we need to get keys added to the MOC keyring?

joachimweyl commented 1 year ago

not currently an option, pushing to icebox

CCI-MOC / ops-issues

Should we relocate to utilize UPS for our storage cluster(s)? #216