CCI-MOC / ops-issues

2 stars 0 forks source link

Should we relocate to utilize UPS for our storage cluster(s)? #216

Open larsks opened 3 years ago

larsks commented 3 years ago

After the latest MGHPCC power outage, our production Ceph cluster is unhealthy and the OpenStack S3 endpoint is inoperative. This isn't the first time a power outage has taken out our storage. Should we invest in a UPS for the storage cluster?

larsks commented 3 years ago

From slack discussion:

5:44 PM
pjd Re: UPS, the Ceph cluster is in the NU racks, right?
5:44 PM
I could ask about moving it to one of the racks on the other side of the aisle, which have backup power.
5:44 PM
All even # racks in aisle 3 pod B have backup
6:02 PM
larsks Something to talk about on Thursday, I guess. I'm going to be out of the office tomorrow.
6:22 PM
msd +1
pjd
All even # racks in aisle 3 pod B have backup
Posted in coredev | Mar 30th | View message
6:27 PM
naved001 well, all our stuff is in the odd racks. and our ceph cluster is distributed across all racks.
6:27 PM
msd yep
6:36 PM
msd assuming you saw the info on research ceph slack channeL?
9:59 PM
naved001 I did see that the research ceph is unhealthy, but that wiill have to wait.
10:55 PM
msd yup
11:36 PM
pjd @naved001 no guarantees, but I’d be willing to do some politicking to see if I could get us half a rack on the even side.
msdisme commented 3 years ago

@pjd-nu any feedback from NEU about swapping cages?

msdisme commented 3 years ago

need to find out if there is a way to add UPS to existing non-UPS racks, cost structure, space impact?

msdisme commented 3 years ago

@pjd-nu and @okrieg is the assumption that in the future the researchers will be responsible for managing the research cephs?

pjd-nu commented 3 years ago

Northeastern has cleared out an even-side rack with backup power that we can use for production Ceph. Ping me if I don't update this with more details...

larsks commented 3 years ago

@pjd-nu I'm pinging you as requested for more details :smile:

naved001 commented 3 years ago

@pjd-nu pinging you again for more details.

We have 10 OSD servers (2U per server), and 3 monitors (1U each). That's a total of 23U.

We would need to put a 10G switch with 40g uplinks, and a 1G IPMI switch.

With that we should be able to migrate all the ceph nodes to the new rack.

naved001 commented 3 years ago

Just noting down what we talked about it:

We don't have to do any of this during the shutdown. We would want to keep things as is during the shutdown since we already have a failing brocade switch, and more things may break during the poweroff/poweron process.

pjd-nu commented 3 years ago

I've always wanted to try using dual power supplies to move a server without turning it off :-)

msdisme commented 1 year ago

@naved001 can you work with @hakasapl and team to come up with a proposal for an MOC ceph cluster that we want to maintain post kaizen shutdown.

msdisme commented 1 year ago

@pjd-nu finally we are ready to explore moving ceph to the even side - is it just a single rack and if so can you confirm the location? Do we need to get keys added to the MOC keyring?

joachimweyl commented 1 year ago

not currently an option, pushing to icebox