Stage.3 Hang in Disks.Deploy

JaderGiacon commented 4 years ago

Description of Issue/Question

Ceph.stage.3 hang in disks.deploy task

Setup

1 Admin 3 MON 6 OSD x 61 disks = 366 disks

Steps to Reproduce Issue

Run the deployment in many disks. I have 61 disks in each OSD server (6 OSD servers). The deployment hang at the end of 11th disk or begin of 12th

Versions Report

deepsea 0.9.23+git.0.6a24f24a0 Suse Enterprise 15 SP-1 - kernel 4.12.14-197.26-default SES6 salt 2019.2.0 (Fluorine) - Release : 6.14.1

Details of my tests

I was suspecting of many things (LockPersonality, tuned profiles, etc). My suspicious was in this kind of things because I have message logs errors or I have found some issue posted here.

In every deployment the system stopped and the disks.deploy step took more than 10 hours to finish with error wait.process In all servers the last log message was: (of course, with different OSD number etc) [2019-11-27 16:05:32,428][ceph_volume.process][INFO ] Running command: /bin/systemctl start ceph-osd@57 [2019-11-27 16:05:32,464][ceph_volume.process][INFO ] Running command: /usr/sbin/lvcreate --yes -l 9313 -n osd-data-fc7000dd-c944-4ac6-a7f1-49b1305d41b7 ceph-1de7f9df-aa55-4baf-be07-675c53d4b386

The suspicion was about some limit that hang the process. In this step, the process that was creating the disks was: (example of one OSD server) root 34361 33264 0 16:03 ? 00:00:03 /usr/bin/python3 /usr/sbin/ceph-volume lvm batch --no-auto /dev/nvme0n1 /dev/sda /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak /dev/sdal /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas /dev/sdat /dev/sdau /dev/sdav /dev/sdaw /dev/sdax /dev/sday /dev/sdaz /dev/sdb /dev/sdba /dev/sdbb /dev/sdbc /dev/sdbd /dev/sdbe /dev/sdbf /dev/sdbg /dev/sdbh /dev/sdbi /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz --yes

My checks:

Number of files : lsof -p 34361 ---> not too big. But there were some file descriptors
ls -l /proc/34361/fd ---> listed around 4 files. Names 1 and 2 was pipe
My colleague Alex (Thanks!) asked to cat the pipes to check what they had
The pipe number 2 had all logs messages (same of /var/log/ceph-volume)
Checked the pipe size and it was near of 64K. The default size

Double check the bug:

Deploy again
As soon the disks.deploy started, I have identified the PID and I have run a basic command to increase the pipe to 1MB (maximum value)
Result:: This node did not hang and all OSD was created successfully. The others hang in the same point (12th disk)

I hope help everyone! Mainly the developers to work in a fix.

I do not know how are the resources to test new versions. But it is good test in big environments. Of course, if it is possible.

Thanks! Jader

jschmid1 commented 4 years ago

Hey,

thanks for your report.

If I understood your report correctly, the ceph-volume lvm batch .. will fail due to a default pipe size of 64K and will succeed if the pipe size is increased 'manually' for the process.

Have you tried to increase the pipe size globally on your system?

JaderGiacon commented 4 years ago

Hi,

Yes, you understood correctly. I will test today the the globally size and post here the results.

Thanks

JaderGiacon commented 4 years ago

Hi,

I did not find any way to change the pipe size globally. I do not think it is possible.

In my tests I have done it using a perl script as soon as the ceph-volume lvm batch... was created:

> # http://unix.stackexchange.com/a/353761/119298
> use strict;
> use Fcntl;
> my $fifo = shift @ARGV or die "usage: fifo size";
> my $size = 1048576;
> open(FD, $fifo) or die "cannot open";
> printf "old size %d\n",fcntl(\*FD, Fcntl::F_GETPIPE_SZ, 0);
> my $new = fcntl(\*FD, Fcntl::F_SETPIPE_SZ, int($size));
> die "failed" if $new<$size;
> printf "new size %d\n",$new;
> 

# During the execution of disks.deploy I have done:
# Got the PID of ceph-volume lvm batch .. 
./pipe /proc/34361/fd/2

Thanks

JaderGiacon commented 4 years ago

Hi everyone,

Just to update and help everyone. I have opened a new support ticket with Suse specific to this problem. I have passed all information they have asked. Including this post with detailed explanation of problem source.

They very likely found the cause in the implementation how subprocesses are being run by DeepSea. They are discussing to solve it.

I will post as soon as I receive the fix information. Or anyone that receive first.

Thanks!

jschmid1 commented 4 years ago

Thanks for the update @JaderGiacon

JaderGiacon commented 4 years ago

Hi everyone,

Suse was able to identify the deadlock in sdterr deepsea. They have generated a fix and reproduce the issue to test the fix.

The fix will be in a deepsea patch. The change log should contain the bug number bsc#1158184 if the patch is in.

I am closing the thread.

Thanks everyone for help. Jader

SUSE / DeepSea