SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
161 stars 75 forks source link

use librbd and librados python bindings #665

Open theanalyst opened 7 years ago

theanalyst commented 7 years ago

We're using the cli for certain ops which should be changed to the python bindings

swiftgist commented 7 years ago

That's fine, but I would like to keep the regular commands at least in the comments so that admins have the equivalence when debugging. I have seen some inconsistencies between some ceph commands and their mon_command namesakes.

jschmid1 commented 7 years ago

I'm currently seeing a deadlock in the librbd rbd.RBD.list(ioctx) call..

investigating -> not switching to librbd/librados just yet.

EDIT: overhauled by https://github.com/SUSE/DeepSea/issues/665#issuecomment-335450506

jschmid1 commented 7 years ago

The (problem) is in the underlying librbd. So either calling /usr/bin/rbd or directly calling librbd (pybindings) will result in following behavior (under certain circumstances).

If you cluster has issues of some sorts like:

Reduced data availability: 512 pgs inactive
Degraded data redundancy: 512 pgs unclean, 512 pgs degraded, 512 pgs undersized

This prevents you from querying any object in the cluster, which is correct and the expected behavior. Being in this state during the deployment phase is not quite uncommon, I'd say. The issue starts at this point. If you now try to execute stage.0, which internally calls salt-minion.restart (or mine.update with this patch #727), which calls cephdisks.list, which calls /usr/bin/rbd -p ls which will try to make a read() to a metadata object in this pool.

librbd is implemented without any sorts of timeout, instead it dos a EAGAIN and tries the read over and over again until it succeeds.

from strace -p <pid> -f

[..] [pid 31400] futex(0x5585625c3774, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 567, {1507636431, 391111275}, ffffffff <unfinished ...> [pid 31397] read(8, "c", 256) = 1 [pid 31397] read(8, 0x7f5a32674320, 256) = -1 EAGAIN (Resource temporarily unavailable) [pid 31397] sendmsg(27, {msg_name(0)=NULL, msg_iov(1)=[{"\7\32\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\0\177\0\1\0\0\0\0\0\0\0\0\0\0"..., 75}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 75 [pid 31397] epoll_wait(5, <unfinished ...> [pid 31399] <... epoll_wait resumed> {}, 5000, 30000) = 0 [pid 31399] epoll_wait(22, [..]

This is overall not a critical behavior imo, but comes with two sideffects:

1) Potentially creating lot's of long-running, zombie-like processes

2) Stage.0 and mine.update will hang if the cluster is unhealty, which can create a catch-22

@rjfd might have the answer to this I totally understand the necessity of having writes that never time-out, but why can't we have timeouts for reads?

rjfd commented 7 years ago

@jschmid1 @swiftgist is there a strong reason for the list of RBD images being served by a Salt mine?

Regarding the read timeout problem, we can deal with this in two ways:

Both approaches are orthogonal to the use of CLI or RBD python bindings.

jschmid1 commented 7 years ago

Check the cluster status before issuing the RBD command, if not HEALTH_OK not RBD command is run

@rjfd Which would also include things like clock skews etc. and potentially lead to strange user experience. Is there a general rule of thumb when a cluster becomes inaccessible? I'm looking at unclean/stale/inactive pg's, but only allowing rbd images to be queried when there are absolutely no issues seem to be a bit strict.

swiftgist commented 7 years ago

@rjfd time. The two slowest operations with real hardware seem to be an hwinfo --disk and multiple 'rbd ls' commands. While the latter can be done in python, I do not think the performance will change dramatically. No user wants to sit and stare at a GUI waiting for some remote job to complete. That was one of the purposes of the Salt mine is to cache results so that subsequent queries returned quickly.

However, if the Salt mine is causing its own nightmares, then we can look at handling this differently. For the most part, I think the Salt mine has helped in most cases for cephdisks. However, when the mines misbehave, it is one of the harder things to diagnose quickly.

rjfd commented 7 years ago

@swiftgist how frequent the mines are updated?