AcalephStorage / ceph-docker

Docker files and images to run Ceph in containers
Apache License 2.0
4 stars 2 forks source link

OSD device formatting fails randomly #6

Closed hunter closed 8 years ago

hunter commented 8 years ago

When launching an OSD container the ceph-disk prepare is failing (randomly it seems). This requires that the OSD is stopped, the disk zapped and then prepared outside of the container

darkcrux commented 8 years ago

what do you think of a script that does zap and ceph-disk prepare repeated on the devices until they are properly prepared? we do this before running the ceph-osd container. Or would it make sense to add it to the container's script to zap and retry if prepare failed?

hunter commented 8 years ago

The latter feels a bit like a nasty hack to work around a bug.

Could try a small script that runs and tests the prep though. Perhaps it could be run as an init-container before the OSD starts? (that way we could get access to the config?)

hunter commented 8 years ago

Would be interested to see if 10.2.3 fixes this...

darkcrux commented 8 years ago

This is a peculiar behaviour really. On ubuntu, it never fails, but once ceph-disk is ran in a container (CoreOS), that's when we get the issue. Yeah, maybe 10.2.3 might have a fix.

hunter commented 8 years ago

An interesting experiment... instead of using the latest "Ubuntu 14.04" image could try Fedora or Ubuntu 16.04 images (ideally we'd use the same one across all containers... just to avoid any weirdness)

hunter commented 8 years ago

@darkcrux if you get a chance, can you post the logs from when a format fails?

darkcrux commented 8 years ago

oh I lost the logs. but from what I've read, sometimes one of the partitions (the data partition) doesn't get created.

darkcrux commented 8 years ago

Will get logs again when I get the chance.

hunter commented 8 years ago

@dexter, shall we try turning the formatting into an init-container that won't launch the pod until init is complete.

darkcrux commented 8 years ago

how do we do it? my initial thought was having the formatting done by a k8s job.

hunter commented 8 years ago

The problem with a k8s job is that it won't stop the OSD container from launching when the container is half formatted. Adding an init-container (running a small bash script) to the OSD pod which loops through formatting, zapping on any error until the drive is formatted should handle launch order better

darkcrux commented 8 years ago

yeah. my initial thought was to format everything with a job. just read about init-containers. are they in 1.4?

hunter commented 8 years ago

We actually part of 1.3 :)

I think they graduated from alpha to beta between 1.3 and 1.4

darkcrux commented 8 years ago

oh. cool. init-container it is then. :)

darkcrux commented 8 years ago

This is ready. just need one final testing (the ceph-daemon container one)