New deployment and day-2-ops tooling software defined storage (ceph) - ADR

SovereignCloudStack / issues

This repository is used for issues that are cross-repository or not bound to a specific repository.

https://github.com/orgs/SovereignCloudStack/projects/6

2 stars 1 forks source link

New deployment and day-2-ops tooling software defined storage (ceph) - ADR #515

Open brueggemann opened 8 months ago

brueggemann commented 8 months ago

As a SCS Operator, I want a well considered and justified decision for a reliable method to deploy and operate ceph to replace ceph-ansible.

Criteria:

The solution should be designed for at least 10 years
The decision has to be well considered and justified based on real scenarios of cloud providers
The migration of the deployment method should be as easy as possible and with minimal downtime.

Tasks (see decision tracking document for detailed status):

[x] Find more criteria and document
[x] Gather information about reference setups from cloud providers (critera, migration path)
[x] Check, what similar projects like OSISM are doing
[x] Research which deployment method for ceph is preferrable (Pros / Cons)
- [x] cephadm
- [x] rook
[x] Create proof-of-concept setups for the considered deployment methods
[ ] Document decision making process

Definition of Done:

[ ] There exists an ADR that documents the evaluation process as well as outlines the path forward (eg: which solution have we committed ourselves to)

Decision tracking document

https://input.scs.community/3aZ-xdnRS-y11lZkrtAvxw

yeoldegrove commented 8 months ago

For task "Gather information about reference setups from cloud providers (critera, migration path)"

questions to cloud providers and/or customers

We want to have a better understanding which Ceph setups, deployed by OSISM, you are currently running. We hope that this input helps us to decide on how to move forward on a possible replacement of ceph-ansible in OSISM.

Which ceph release are you running?
What is the size of your ceph cluster?
- Are Ceph workloads sharing the hardware with other workloads (hyperconverged)? If yes, why?
- Are you running multiple pools or even multiple clusters? If yes, why?
Which ceph features/daemons are you using and how are they integrated into OpenStack and/or other services?
Which hardware are you using (either sizing or specs)?
- CPU/RAM
- HDDs/SSDs/NVMEs(/Controllers)
- Are you splitting "OSD setup" and "BlueStore WAL+DB"
- NICs/speed/latency
- Are you using splitting Dataplane and Controlplane?
Which Ceph config is deployed by OSISM?
- Do you mind sharing the acutal config/yaml?
Which Ceph config is deployed "unknown to" or "on top of" OSISM"?
- e.g. special crush maps, special configs
Would it be nice to have more Ceph features deployable via OSISM?
Are there any Ceph features that have to be deployed by OSISM currently that OSISM should not handle?
What is your justified opinion on a new deployment method for Ceph (instead of ceph-ansible)?
- What about Cephadm?
- Are you maybe already using it in your current cluster (deployed by hand)?
- What about Rook?
- Are you maybe already using it on top of a k8s deployed in OpenStack?
Are there any other exciting facts about your Ceph setup you would like to share?

horazont commented 8 months ago

We are using Rook for all our new Ceph deployments. Previously, we used Ceph-Chef.

Deployment Method: Rook is the natural choice for us because we are running Kubernetes on bare metal already for YAOOK Operator. It integrates well with YAOOK because both of them are using Kubernetes.

In addition, we have made excellent experience with the performance, maintainability and reliability of Rook.io clusters, in particular compared to our previous static deployment method (Ceph-Chef).

All methods have their downsides, and so does the Rook method. In partiuclar:

Rook suffers in particular from ceph-volume shortcomings, because it's not that easy to bypass these shortcomings when going through Rook. (We suffered a lot when we had multipath devices which weren't handled correctly by ceph-volume.)
You need to have some knowledge in Kubernetes concepts in addition to Ceph concepts to run it. (Though in contrast to e.g. cephadm specific knowledge, the Kubernetes knowledge to obtain is likely to be useful in other situations, too.)
No support for RadosGW + Keystone so far. We are working on that together with Uhurutec though.

Version: With Rook, we are running 16.x with the plan to upgrade to 17.x soon-ish, though we are blocked there for non-Ceph and non-Rook reasons.

Hardware: Varying and historically grown, I'd have to look that up. Hit me up via email if you need that information: mailto:jonas.schaefer@cloudandheat.com.

Features: We use RBD exclusively with Rook so far (see above), we intend to enable S3 and Swift frontends once we implemented support for that (currently, these needs are served by our old Ceph-Chef cluster). We use CephFS in non-bare-metal cases, too.

berendt commented 8 months ago

We use the Quincy release (17.2.6) provided by OSISM 6.0.2 everywhere

We have a single small hyperconverged cluster for a specific customer workload. Otherwise we only use dedicated Ceph clusters. We currently have a single cluster that provides HDD and NVMe SSD as RBDs for Cinder/Nova. In addition, we have a cluster that is used exclusively for RGW and is offered as a Swift and S3 endpoint (integrated in Keystone and Glance, in future also Cinder (for backups)).

At the moment we are deploying the control plane on the Ceph OSD nodes and do not have any dedicated nodes for the control plane. We also do not split the data plane and control plane on the network side. We currently have 2x 100G in the Ceph nodes there. The compute nodes have 2x 25G (will be 2x 100G in the future as well). Latencies between the nodes are approx. 0.05ms (ICMP).

We have a separate pool for each OpenStack service (images, vms, volumes). We have several pools for Cinder so that we can partially separate customers.

We use the following services: osd, mon, mgr, rgw, crash. We would also like to take a look at mds in the future in order to be able to offer CephFS via Manila if necessary.

We can share details about hardware and the configuration in full if required.

We do not optimize the systems directly with the Ceph-Ansible part of OSISM, but use the tuned, sysctl and network roles from OSISM for this.

We are satisfied with what we can currently do with OSISM. We would only need more functionality in Day2 Operations in the future.

We have also recently added the option of deploying Kubernetes directly on all nodes in OSISM. We are open for Rook and cephadm. We are currently tending towards Rook as we believe it is the more consistent step.

fkr commented 7 months ago

@flyersa Can you give feedback as well? I think, it would be helpful.

Nils98Ar commented 7 months ago

We deploy Ceph Quincy completely with OSISM, replica count=3 and osds_per_device=2. Balancer is currently configured to warn and we adjust the pg_count if there's a warning.
Nodes: 5 storage nodes (osd, mds, rgw, crash) and 3 control nodes (mon, mgr, crash). The control nodes are also OpenStack control nodes.
Pools: we have only the default pools for .mgr, volumes, images, metrics, vms, cephfs (data and metadata) and rgw (multiple)
Services: rbd for OpenStack cinder (and currently also nova), cephFS for OpenStack manila and rgw for OpenStack S3 like this: https://osism.github.io/docs/guides/configuration-guide/ceph#rgw-service.
CPUs and Memory:
- 1x10 core Intel CPUs + 64GB memory in the control nodes
- 2x12 core Intel CPUs + 128GB memory in the storage nodes
Disks/Controllers: Control nodes/storage nodes OS each: SSD SATA 240GB (on the control nodes RAID 1 with controller) Storage nodes Ceph each: 2x SSD NVMe 1,6TB and 10x SSD SAS 3,2TB
We are not splitting "OSD setup" and "BlueStore WAL+DB" as Ceph should do this automatically in our understanding.
Interfaces: 1G interfaces for console and two 10G balance-xor bonds for Ceph frontend/backend (separate).
We would like to deploy CephFS ha active-active, which seems to be possible at least with cephadm

Maybe we will switch to a hyper-converged setup (compute/storage/maybe network) and 25G interfaces in the future.

frosty-geek commented 7 months ago

Which ceph release are you running?

Quincy 17.2.6 → 42on note: recommends to latest quincy 17.2.7

What is the size of your ceph cluster?

~0.5 PB cluster

Are Ceph workloads sharing the hardware with other workloads (hyperconverged)? If yes, why?

yes we're hyper converged, because of the business case and the way we started.

Are you running multiple pools or even multiple clusters? If yes, why?

yes → VMs (nova disk), RGW, Images (glance), Volumes (cinder), Backup...
we're not running multiple clusters, we have 1 cluster per "region" (scs1, prod1, prod2, prod3...)

Which ceph features/daemons are you using and how are they integrated into OpenStack and/or other services?

RadosGW → Ceph Object Gateway Swift API
Rados Block Devices for Nova/Cinder/Glance

Wishlist:

CephFS → k8s RWX → needs multi tenancy
Manila with CephFS Backend (ganesha) → check ceph reef release with active/active backend

Which hardware are you using (either sizing or specs)? CPU/RAM

CPU/RAM not being reserved/dedicated assigned atm but that's planned

HDDs/SSDs/NVMEs(/Controllers)

SSD only/HBAs → We don't split Metadata + Data and for now not considering it

Are you splitting "OSD setup" and "BlueStore WAL+DB" → 3x Controller Nodes running Ceph Management Components (Mons, MGRs, RGWs...), Hypervisors running only OSDs

NICs/speed/latency

2x active/passive 25 GBit NIC (fibre) → 0.09 - 0.2 ms → VLAN+NPAR → past different physical NICs for Frontend (Controlplane)+ Backend (Dataplane), moving back to 1 physical device with NPAR/QoS

Which Ceph config is deployed by OSISM? Do you mind sharing the actual config/yaml? Default config shipped with the reference implementation

OSISM default config with additions to rack aware crush rule, increase PGs dramatically (as advised by 42on), added quotas for RGW, enabled dashboard + admin user, added auth_allow_insecure_global_id_reclaim, OSD memory target

Which Ceph config is deployed "unknown to" or "on top of" OSISM"? e.g. special crush maps, special configs

see above

Would it be nice to have more Ceph features deployable via OSISM?

see above wishlist → dashboard enablement + user for it → get rid of ceph credentials/secrets from config git repo

What is your justified opinion on a new deployment method for Ceph (instead of ceph-ansible)? What about Cephadm? Are you maybe already using it in your current cluster (deployed by hand)? What about Rook? Are you maybe already using it on top of a k8s deployed in OpenStack?

we plan to use cephadm in the future, depending on the future use of k3s being able to also run on the hypervisors we may consider rook

flyersa commented 7 months ago

Which ceph release are you running?

Pacific and Reef

What is the size of your ceph cluster?

Are Ceph workloads sharing the hardware with other workloads (hyperconverged)? If yes, why?

Are you running multiple pools or even multiple clusters? If yes, why?

different pools for different storage class such as magnetic, ssd, nvme

Which ceph features/daemons are you using and how are they integrated into OpenStack and/or other services?

mainly rbd and radosgw with swift support

Which hardware are you using (either sizing or specs)?

Mainly HPE such as Apollo 4200 or similar

Are you splitting "OSD setup" and "BlueStore WAL+DB"

of course

NICs/speed/latency

2x 40GIG or 4x 10Gig, depending on scenario and potential troughput

Are you using splitting Dataplane and Controlplane?

no, monitors and mgr usually go on the storage nodes

Which Ceph config is deployed by OSISM?

none, we never deploy with osism and use cephadm. We had customer faults based on user error in the past already damaging ceph clusters, thus we focus on a strong seperation from storage and openstack.

Would it be nice to have more Ceph features deployable via OSISM?

for others maybe, as i said that does not belong into the same system that i manage my compute resources for various operational topics

What is your justified opinion on a new deployment method for Ceph (instead of ceph-ansible)?

We should use what is used upstream, for ceph the tool is now cephadm, so of course we should use this

What about Rook?

While rook adds alot in regards to fault tolerance and so on it adds complexity too, i am not a huge fan of rook, in a CSP environment usually you have dedicated servers (if not HCI) for storage, no need to add a k8s cluster on top of it...

yeoldegrove commented 7 months ago

Our current decision tracking is done here: https://input.scs.community/3aZ-xdnRS-y11lZkrtAvxw

flyersa commented 7 months ago

btw. another point to freaking get rid of ceph-ansible... Ever did an upgrade? In the time this crap takes alone to upgrade a single monitor i upgrade complete datacenters to a new ceph version with cephadm...