rbeldin commented 4 years ago

Description of Issue/Question

In order to learn ceph (SES 6), I setup VMs cluster, all running SLES15 SP1 and registered to SUSE. I can make it through most of the setup, but device discovery seems to failing. I have 3 OSDs, each with 3 drives.

Salt says I don't have drives for data and db.

`#salt-run disks.report Found DriveGroup Calling dg.report on compound target I@roles:storage No valid json in ceph osd tree. Probably no cluster deployed yet.

ses6-osd1:

  no_data_devices:

      You didn't specify data_devices. No actions will be taken.

ses6-osd2:

  no_data_devices:

      You didn't specify data_devices. No actions will be taken.

ses6-osd3:

  no_data_devices:

      You didn't specify data_devices. No actions will be taken.`

Executing ceph-volume on the osd nodes works and shows 3 available drives on each node:

`ses6-admin:~ # salt 'ses6-osd*' cmd.run 'ceph-volume inventory' ses6-osd1:

Device Path               Size         rotates available Model name
/dev/vdb                  1024.00 MB   True    True
/dev/vdc                  1024.00 MB   True    True
/dev/vdd                  1024.00 MB   True    True
/dev/vda                  24.00 GB     True    False

ses6-osd3:

Device Path               Size         rotates available Model name
/dev/vdb                  1024.00 MB   True    True
/dev/vdc                  1024.00 MB   True    True
/dev/vdd                  1024.00 MB   True    True
/dev/vda                  24.00 GB     True    False

ses6-osd2:

Device Path               Size         rotates available Model name
/dev/vdb                  1024.00 MB   True    True
/dev/vdc                  1024.00 MB   True    True
/dev/vdd                  1024.00 MB   True    True
/dev/vda                  24.00 GB     True    False`

One thing that is strange is that I have no profile* directory in /srv anywhere. Some commands warn about this but I am not sure how to correct it:

# salt-run push.proposal [WARNING ] profile-default/cluster/*.sls matched no files [WARNING ] profile-defualt/stack/default/ceph/minions/*yml matched no files [WARNING ] role-mon/stack/default/ceph/minions/ses6-mon*.yml matched no files True

I've started over multiple times by doing:

salt-run disengage.safety salt-run state.orch ceph.purge

I end up at the same spot every time - even after recreating the proposal.

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)

My policy.cfg is: `cluster-ceph/cluster/.sls profile-default/cluster/.sls profile-defualt/stack/default/ceph/minions/*yml

config/stack/default/global.yml config/stack/default/ceph/cluster.yml

master and admin

role-master/cluster/ses6-admin.sls role-admin/cluster/ses6-admin.sls

role-mds/cluster/ses6-mds*.sls

role-mon/stack/default/ceph/minions/ses6-mon.yml role-mon/cluster/ses6-mon.sls role-mgr/cluster/ses6-mon*.sls

role-storage/cluster/ses6-osd*.sls`

And my global.yml is:

`master_minion: ses6-admin

subvolume_init: disabled`

My drive_groups.yml looks like this:

`ses6-admin:~ # cat /srv/salt/ceph/configuration/files/drive_groups.yml

default: target: 'I@roles:storage' data_devices: all: true db_devices: all: true`

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)

Versions Report

Maybe this is the problem? ses6-admin:~ # salt-run deepsea-version 'deepsea-version' is not available.

ses6-admin:~ # rpm -qa | grep deepsea deepsea-0.9.27+git.0.93a84d2ea-3.9.1.noarch deepsea-cli-0.9.27+git.0.93a84d2ea-3.9.1.noarch

But these are the only two deepsee rpms from the repo:

ses6-admin:~ # zypper se deepsea* Refreshing service 'Basesystem_Module_15_SP1_x86_64'. Refreshing service 'SUSE_Enterprise_Storage_6_x86_64'. Refreshing service 'SUSE_Linux_Enterprise_Server_15_SP1_x86_64'. Refreshing service 'Server_Applications_Module_15_SP1_x86_64'. Loading repository data... Reading installed packages...

ses6-admin:~ # rpm -qi salt-minion Name : salt-minion Version : 2019.2.0 Release : 6.21.1 Architecture: x86_64 Install Date: Mon 20 Jan 2020 12:08:20 PM PST Group : System/Management Size : 41019 License : Apache-2.0 Signature : RSA/SHA256, Wed 04 Dec 2019 12:02:15 PM PST, Key ID 70af9e8139db7c82 Source RPM : salt-2019.2.0-6.21.1.src.rpm Build Date : Wed 04 Dec 2019 11:57:57 AM PST Build Host : sheep70 Relocations : (not relocatable) Packager : https://www.suse.com/ Vendor : SUSE LLC https://www.suse.com/ URL : http://saltstack.org/ Summary : The client component for Saltstack Description : Salt minion is queried and controlled from the master. Listens to the salt master and execute the commands. Distribution: SUSE Linux Enterprise 15

ses6-admin:~ # rpm -qi salt-master Name : salt-master Version : 2019.2.0 Release : 6.21.1 Architecture: x86_64 Install Date: Mon 20 Jan 2020 12:08:21 PM PST Group : System/Management Size : 2936818 License : Apache-2.0 Signature : RSA/SHA256, Wed 04 Dec 2019 12:02:15 PM PST, Key ID 70af9e8139db7c82 Source RPM : salt-2019.2.0-6.21.1.src.rpm Build Date : Wed 04 Dec 2019 11:57:57 AM PST Build Host : sheep70 Relocations : (not relocatable) Packager : https://www.suse.com/ Vendor : SUSE LLC https://www.suse.com/ URL : http://saltstack.org/ Summary : The management component of Saltstack with zmq protocol supported Description : The Salt master is the central server to which all minions connect. Enabled commands to remote systems to be called in parallel rather than serially. Distribution: SUSE Linux Enterprise 15`

rbeldin commented 4 years ago

Apologies for the weird html. I'm not always clear on what will trigger automatic html code. :(

rbeldin commented 4 years ago

I didn't really start over, but started adapting commands from:

SUSE "Deploy a Fully Functional SUSE Enterprise Storage Cluster Test Environment in About 30 minutes" - https://www.suse.com/c/deploy-a-suse-enterprise-storage-test-environment-in-about-30-minutes/
this made use of different commands

deepsea stage run ceph.stage.N

vs

salt-run state.orch ceph.stage.N

It isn't clear to me what the difference between the two is. I thought they would be equivalent. I prefer that the deepsea commands provide a hint at what is going on.

it also had me setup a DNS server on the admin node and configure the nodes to use it. This seems like a bit of overhead for what could often be a standalone and isolated setup.

After altering policy.cfg to match the example, and running through this again, it now gets further but complains at stage 3:

d fqdn : ['fqdn ses6-mon1.ceph does not match minion id ses6-mon1', 'fqdn ses6-osd3.ceph does not match minion id ses6-osd3', 'fqdn ses6-mds.ceph does not match minion id ses6-mds', 'fqdn ses6-osd2.ceph does not match minion id ses6-osd2', 'fqdn ses6-mon3.ceph doe s not match minion id ses6-mon3', 'fqdn ses6-mon2.ceph does not match minion id ses6-mon2', 'f qdn ses6-osd1.ceph does not match minion id ses6-osd1', 'fqdn ses6-admin.ceph does not match m inion id ses6-admin']

... Stage execution failed:

validate failed

And there are still no disks found...

I'm missing what is going wrong here as name resolution seems to be working fine:

ses6-admin:~ # salt '' cmd.run 'host hostname' ses6-admin: ses6-admin has address 192.168.122.10 ses6-mon2: ses6-mon2 has address 192.168.122.21 ses6-mon1: ses6-mon1 has address 192.168.122.20 ses6-mds: ses6-mds has address 192.168.122.15 ses6-osd3: ses6-osd3 has address 192.168.122.32 ses6-mon3: ses6-mon3 has address 192.168.122.22 ses6-osd2: ses6-osd2 has address 192.168.122.31 ses6-osd1: ses6-osd1 has address 192.168.122.30 ses6-admin:~ # salt '' cmd.run 'host hostname.ceph' ses6-admin: ses6-admin.ceph has address 192.168.122.10 ses6-mon1: ses6-mon1.ceph has address 192.168.122.20 ses6-osd2: ses6-osd2.ceph has address 192.168.122.31 ses6-mon3: ses6-mon3.ceph has address 192.168.122.22 ses6-mds: ses6-mds.ceph has address 192.168.122.15 ses6-mon2: ses6-mon2.ceph has address 192.168.122.21 ses6-osd3: ses6-osd3.ceph has address 192.168.122.32 ses6-osd1: ses6-osd1.ceph has address 192.168.122.30

This is quite odd because name resolution works forward and backwards. That is I can resolve all hostnames and all ips back to hostnames for all members of the cluster.

There are no partition tables on any of the disks. Because this is a test setup, I made them small - 1gb. Is this a problem?

rbeldin commented 4 years ago

RTFM: https://documentation.suse.com/ses/5.5/html/ses-all/storage-bp-hwreq.html#deployment-osd-recommendation

2.1.2 Minimum Disk Size REPORT DOCUMENTATION BUG#EDIT SOURCE There are two types of disk space needed to run on OSD: the space for the disk journal (for FileStore) or WAL/DB device (for BlueStore), and the primary space for the stored data. The minimum (and default) value for the journal/WAL/DB is 6 GB. The minimum space for data is 5 GB, as partitions smaller than 5 GB are automatically assigned the weight of 0.

So although the minimum disk space for an OSD is 11 GB, we do not recommend a disk smaller than 20 GB, even for testing purposes.

PS:

It would be nice if deployment gave a better message about disk size. I can see that the size check is baked into code in function osd_list...

SUSE / DeepSea

Device discovery not working in KVM environment #1819

Description of Issue/Question

`#salt-run disks.report Found DriveGroup Calling dg.report on compound target I@roles:storage No valid json in ceph osd tree. Probably no cluster deployed yet.

ses6-osd1:

ses6-osd2:

ses6-osd3:

Setup

master and admin

Steps to Reproduce Issue

Versions Report