TheJJ / ceph-cheatsheet

All™ you ever wanted to know about operating a Ceph cluster!
230 stars 58 forks source link

Add read balancer info #4

Open JoshSalomon opened 1 week ago

JoshSalomon commented 1 week ago

Hi JJ - great Page, I believe it is worth adding information about the read balancer, especially since the Squid version will support OSDs of different sizes. Would you like to work with @ljflores and me about it?

TheJJ commented 1 week ago

sure, what do you have in mind?

JoshSalomon commented 1 week ago

Wondering where to start: Have you heard anything about the read balancer (available since Reef)?

TheJJ commented 6 days ago

yes, i saw the initial presentation slides, and wondered how it compared to my balancer, but I didn't priorize just remapping primaries so far, but i think this can be added, too. I didn't use it in a production cluster so far.

Thinking about balancers, it seems that the whole crush approach may not be ideal after all, and just having an efficient pg->osd mapping lookup table is probably suitable for nearly all clusters. then we wouldn't have to fight crush with one hack after the other to get a better desired mapping adjustment, instead of (re)mapping it directly.

JoshSalomon commented 4 days ago

Ceph improved read balancer.pdf If I understand correctly - your balancer is a capacity balancer, not a read balancer - but I just heard your presentation in the past and did not dive into the code. The read balancer is only a meta data operation, and it does not move data so it is a completely different approach, and is cheaper to execute continuously (more on this later) The first version (in Reef) just makes sure that in each OSD you have the fair share of primaries (the read balancer works only on replicated pools so in each OSD we try to make pg_num/replica_num primaries). Obviously we check it against CRUSH constraints. In Squid, we added a functionality that improves cluster performance when the devices are not of the same size. We added a pool parameter for the read_ratio of the IOs to the pool (70 means that 70% of the ios to the pool are read and 30% write) - with this information, we can optimally move more reads to the smaller devices and let the larger devices handle less reads so we try to balance the IOPS per OSD (assuming the devices have the same performance profile). In the future, we may calculate the read ratio automatically based on metrics and make it an adaptive system (I am not sure this is needed, but it will be easy to implement) for optimal performance.
Attached is the presentation explaining the model behind this balancer and some examples. If you think this is worth mentioning, Laura and I can open a PR with the explanation to this Ceph guide