Closed JaderGiacon closed 4 years ago
Hey,
thanks for your report.
If I understood your report correctly, the ceph-volume lvm batch ..
will fail due to a default pipe size of 64K and will succeed if the pipe size is increased 'manually' for the process.
Have you tried to increase the pipe size globally on your system?
Hi,
Yes, you understood correctly. I will test today the the globally size and post here the results.
Thanks
Hi,
I did not find any way to change the pipe size globally. I do not think it is possible.
In my tests I have done it using a perl script as soon as the ceph-volume lvm batch...
was created:
> # http://unix.stackexchange.com/a/353761/119298
> use strict;
> use Fcntl;
> my $fifo = shift @ARGV or die "usage: fifo size";
> my $size = 1048576;
> open(FD, $fifo) or die "cannot open";
> printf "old size %d\n",fcntl(\*FD, Fcntl::F_GETPIPE_SZ, 0);
> my $new = fcntl(\*FD, Fcntl::F_SETPIPE_SZ, int($size));
> die "failed" if $new<$size;
> printf "new size %d\n",$new;
>
# During the execution of disks.deploy I have done:
# Got the PID of ceph-volume lvm batch ..
./pipe /proc/34361/fd/2
Thanks
Hi everyone,
Just to update and help everyone. I have opened a new support ticket with Suse specific to this problem. I have passed all information they have asked. Including this post with detailed explanation of problem source.
They very likely found the cause in the implementation how subprocesses are being run by DeepSea. They are discussing to solve it.
I will post as soon as I receive the fix information. Or anyone that receive first.
Thanks!
Thanks for the update @JaderGiacon
Hi everyone,
Suse was able to identify the deadlock in sdterr deepsea. They have generated a fix and reproduce the issue to test the fix.
The fix will be in a deepsea patch.
The change log should contain the bug number bsc#1158184
if the patch is in.
I am closing the thread.
Thanks everyone for help. Jader
Description of Issue/Question
Ceph.stage.3 hang in disks.deploy task
Setup
1 Admin 3 MON 6 OSD x 61 disks = 366 disks
Steps to Reproduce Issue
Run the deployment in many disks. I have 61 disks in each OSD server (6 OSD servers). The deployment hang at the end of 11th disk or begin of 12th
Versions Report
deepsea 0.9.23+git.0.6a24f24a0 Suse Enterprise 15 SP-1 - kernel 4.12.14-197.26-default SES6 salt 2019.2.0 (Fluorine) - Release : 6.14.1
Details of my tests
I was suspecting of many things (LockPersonality, tuned profiles, etc). My suspicious was in this kind of things because I have message logs errors or I have found some issue posted here.
In every deployment the system stopped and the disks.deploy step took more than 10 hours to finish with error wait.process In all servers the last log message was: (of course, with different OSD number etc) [2019-11-27 16:05:32,428][ceph_volume.process][INFO ] Running command: /bin/systemctl start ceph-osd@57 [2019-11-27 16:05:32,464][ceph_volume.process][INFO ] Running command: /usr/sbin/lvcreate --yes -l 9313 -n osd-data-fc7000dd-c944-4ac6-a7f1-49b1305d41b7 ceph-1de7f9df-aa55-4baf-be07-675c53d4b386
The suspicion was about some limit that hang the process. In this step, the process that was creating the disks was: (example of one OSD server) root 34361 33264 0 16:03 ? 00:00:03 /usr/bin/python3 /usr/sbin/ceph-volume lvm batch --no-auto /dev/nvme0n1 /dev/sda /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak /dev/sdal /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas /dev/sdat /dev/sdau /dev/sdav /dev/sdaw /dev/sdax /dev/sday /dev/sdaz /dev/sdb /dev/sdba /dev/sdbb /dev/sdbc /dev/sdbd /dev/sdbe /dev/sdbf /dev/sdbg /dev/sdbh /dev/sdbi /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz --yes
My checks:
Double check the bug:
I hope help everyone! Mainly the developers to work in a fix.
I do not know how are the resources to test new versions. But it is good test in big environments. Of course, if it is possible.
Thanks! Jader