Closed smithfarm closed 5 years ago
@jschmid1 @tserong Any help appreciated. Saw this for the first time today, which makes me wonder what was merged recently that might have caused this. . . it might be that I changed the policy.cfg/profile generation code in deepsea.py
, but I don't remember doing so.
The engulf functest checks a couple of things.
One is whether the imported storage profiles match the existing ones, which they don't here -- note how the diff of default and imported storage profiles shows the imported ones having an empty list of OSDs? I assume this is because we're running against Nautilus, which doesn't have ceph-disk, but the engulf function (via cephinspector.py) still uses ceph-disk to figure out what disks are configured.
The other thing the engulf does, is compare the pillar data across all nodes before and after the engulf, to see if they match (modulo a couple of allowed discrepancies). Unhelpfully, it's not actually telling us what didn't match, although it may just be a flow-on effect of the storage profile failure. Is there any way to see the contents of /tmp/pillar-pre-engulf.yml
and /tmp/pillar-post-engulf.yml
, or has that since evaporated?
So, no, this is not Nautilus - it's Mimic, so ceph-disk is present. I will obtain and post the contents of those files. Thanks, @tserong
Thanks @smithfarm (I should probably make the "unexpected pillar mismatch after engulf" thingy print out what actually doesn't match...)
This appears to be a regression introduced in 0.9.8. Given DeepSea 0.9.7 and Ceph 13.2.2, the functests pass.
I updated the issue description to include info about Ceph version and DeepSea version.
@tserong There are multiple failures in ceph.functests.1node
with DeepSea master and Nautilus. Since the Nautilus RPMs are already in D:S:6.0 and SES6 product, I'll leave it to the DS developers to reproduce and fix.
@tserong OK, I got the files you asked for:
target192168000068:/home/ubuntu # cat /tmp/pillar-pre-engulf.yml
local:
available_roles:
- storage
- admin
- mon
- mds
- mgr
- igw
- openattic
- rgw
- ganesha
- client-cephfs
- client-radosgw
- client-iscsi
- client-nfs
- benchmark-rbd
- benchmark-blockdev
- benchmark-fs
- master
benchmark:
default-collection: simple.yml
extra_mount_opts: nocrc
job-file-directory: /run/ceph_bench_jobs
log-file-directory: /var/log/ceph_bench_logs
work-directory: /run/ceph_bench
ceph:
storage:
osds:
/dev/vdb:
format: bluestore
/dev/vdc:
format: bluestore
/dev/vdd:
format: bluestore
/dev/vde:
format: bluestore
cluster: ceph
cluster_network: 192.168.0.0/24
deepsea_minions: '*'
fsid: edaed396-bf02-4241-afb7-216c565ca015
public_network: 192.168.0.0/24
roles:
- master
- admin
- mon
- mgr
- mds
- rgw
- storage
stage_prep_master: default-no-update-no-reboot
stage_prep_minion: default-no-update-no-reboot
time_server: target192168000068.teuthology
target192168000068:/home/ubuntu # cat /tmp/pillar-post-engulf.yml
local:
available_roles:
- storage
- admin
- mon
- mds
- mgr
- igw
- openattic
- rgw
- ganesha
- client-cephfs
- client-radosgw
- client-iscsi
- client-nfs
- benchmark-rbd
- benchmark-blockdev
- benchmark-fs
- master
benchmark:
default-collection: simple.yml
extra_mount_opts: nocrc
job-file-directory: /run/ceph_bench_jobs
log-file-directory: /var/log/ceph_bench_logs
work-directory: /run/ceph_bench
cluster: ceph
cluster_network: 192.168.0.0/24
configuration_init: default-import
deepsea_minions: '*'
fsid: edaed396-bf02-4241-afb7-216c565ca015
public_network: 192.168.0.0/24
roles:
- storage
- master
- mds
- mgr
- mon
- rgw
stage_prep_master: default-no-update-no-reboot
stage_prep_minion: default-no-update-no-reboot
time_server: target192168000068.teuthology
Just to reiterate, this is Ceph 13.2.2 and DeepSea 0.9.8. With DeepSea 0.9.7 this issue does not happen and the functests orchestration passes.
With Ceph 14.0.0 (and/or 14.0.1), the same failure appears, along with some other failures.
DeepSea 0.9.7 doesn't include the engulf functest at all (they were introduced by #1387, although functests were passing in that PR).
From the diff of those two tmp files, the only thing it should be whining about is the missing OSD information, although I still don't know why it didn't detect the OSDs / write them correctly to the profile-import directory.
The problem is caused by DeepSea now deploying LVM OSDs.
The engulf function calls cephinspector.get_ceph_disks_yml()
, which in turn runs ceph-disk list
, then iterates through the output looking for OSD data partitions, in order to generate the imported profiles. For LVM volumes, ceph-disk list
comes back with the partition type "other", which cephinspector.get_ceph_disks_yml()
skips over (it's only looking for part_dict["type"] == "data"
).
The integration tests have been redone recently, and this included completely redoing how policy.cfg and the storage profiles get generated. The result is the following error in
ceph.functests.1node
:DeepSea version is 0.9.8, installed from RPM:
Ceph is 13.2.2.
Orchestration ceph.functests.1node fails like so:
I'd appreciate help figuring out what's wrong - i.e. is it a bug in my policy.cfg generation code, or is it a bug in the functests?
Here is a full teuthology log showing the failure. Between Stages 1 and 2 the full policy.cfg/storage profile generation process is shown.
http://10.86.0.135/ubuntu-2018-11-06_20:07:05-suse-wip-qa-storage-roles---basic-openstack/262/teuthology.log