SES6: ceph.functests.1node fails to engulf new-style test cluster ("unexpected pillar mismatch after engulf")

smithfarm commented 5 years ago

The integration tests have been redone recently, and this included completely redoing how policy.cfg and the storage profiles get generated. The result is the following error in ceph.functests.1node:

DeepSea version is 0.9.8, installed from RPM:

$ sudo salt-run deepsea.version 2>/dev/null
0.9.8

Ceph is 13.2.2.

Orchestration ceph.functests.1node fails like so:

$ sudo bash -c 'DEV_ENV=true timeout 60m salt-run --no-color state.orch ceph.functests.1node 2>/dev/null'
firewall                 : [1m[92mdisabled[0m
apparmor                 : [1m[92menabled on 1/1 minions; 1 minions have all ceph.apparmor profiles[0m
* Wrote /srv/pillar/ceph/proposals/policy.cfg
* Storage profiles are in /srv/pillar/ceph/proposals/profile-import
* ceph.conf imported as /srv/salt/ceph/configuration/files/ceph.conf.import

'configuration_init' is set to 'default-import'.  This means the imported
ceph.conf file will be used as-is.  It is highly recommended you migrate
any necessary settings from this file to individual files in DeepSea's
/srv/salt/ceph/configuration/files/ceph.conf.d directory, then remove
the 'configuration_init' override setting from
/srv/pillar/ceph/proposals/config/stack/default/ceph/cluster.yml
target192168000053.teuthology_master:
  Name: check keyrings - Function: salt.state - Result: Changed Started: - 20:24:34.188859 Duration: 4979.851 ms
  Name: run test state - Function: salt.state - Result: Changed Started: - 20:24:39.168977 Duration: 1105.802 ms
  Name: target192168000053.teuthology - Function: test.nop - Result: Clean Started: - 20:24:40.275711 Duration: 0.785 ms
  Name: PGs greater than zero - Function: salt.state - Result: Changed Started: - 20:24:40.276735 Duration: 1011.908 ms
  Name: check active+clean - Function: salt.state - Result: Changed Started: - 20:24:41.288994 Duration: 524.566 ms
  Name: check not active+clean - Function: salt.state - Result: Changed Started: - 20:24:41.813871 Duration: 10171.94 ms
  Name: reset systemctl initially for mds - Function: salt.state - Result: Changed Started: - 20:24:51.986102 Duration: 519.709 ms
  Name: grains.setval - Function: module.run - Result: Changed Started: - 20:24:52.506921 Duration: 8.979 ms
  Name: check mds no restart - Function: salt.state - Result: Changed Started: - 20:24:52.516153 Duration: 12875.062 ms
  Name: check mds forced restart - Function: salt.state - Result: Changed Started: - 20:25:05.391554 Duration: 1210.613 ms
  Name: change mds.conf - Function: salt.state - Result: Changed Started: - 20:25:06.602458 Duration: 509.312 ms
  Name: create ceph.conf mds - Function: salt.state - Result: Changed Started: - 20:25:07.112030 Duration: 3179.073 ms
  Name: distribute ceph.conf mds - Function: salt.state - Result: Changed Started: - 20:25:10.291367 Duration: 630.32 ms
  Name: changed.mds - Function: salt.runner - Result: Changed Started: - 20:25:10.921959 Duration: 715.082 ms
  Name: check mds - Function: salt.state - Result: Changed Started: - 20:25:11.637404 Duration: 910.288 ms
  Name: remove mds.conf - Function: salt.state - Result: Changed Started: - 20:25:12.547968 Duration: 585.025 ms
  Name: reset systemctl mds - Function: salt.state - Result: Changed Started: - 20:25:13.133265 Duration: 505.15 ms
  Name: reset ceph.conf mds - Function: salt.state - Result: Changed Started: - 20:25:13.638705 Duration: 3164.31 ms
  Name: redistribute ceph.conf mds - Function: salt.state - Result: Changed Started: - 20:25:16.803278 Duration: 621.727 ms
  Name: changed.mds - Function: salt.runner - Result: Changed Started: - 20:25:17.425263 Duration: 715.554 ms
  Name: check mds again - Function: salt.state - Result: Changed Started: - 20:25:18.141156 Duration: 1063.576 ms
  Name: shrink mds cluster - Function: salt.state - Result: Changed Started: - 20:25:19.205019 Duration: 1759.149 ms
  Name: wait until all active mds but one have stopped - Function: salt.state - Result: Changed Started: - 20:25:20.964433 Duration: 12584.09 ms
  Name: reset mds cluster - Function: salt.state - Result: Changed Started: - 20:25:33.548836 Duration: 1776.338 ms
  Name: wait until all mds are back - Function: salt.state - Result: Changed Started: - 20:25:35.325450 Duration: 6574.309 ms
  Name: reset systemctl initially for mgr - Function: salt.state - Result: Changed Started: - 20:25:41.900033 Duration: 509.281 ms
  Name: grains.setval - Function: module.run - Result: Changed Started: - 20:25:42.409653 Duration: 7.763 ms
  Name: check mgr no restart - Function: salt.state - Result: Changed Started: - 20:25:42.417683 Duration: 11185.975 ms
  Name: check mgr forced restart - Function: salt.state - Result: Changed Started: - 20:25:53.603916 Duration: 1000.098 ms
  Name: change mgr.conf - Function: salt.state - Result: Changed Started: - 20:25:54.604291 Duration: 942.949 ms
  Name: create ceph.conf mgr - Function: salt.state - Result: Changed Started: - 20:25:55.547507 Duration: 3242.158 ms
  Name: distribute ceph.conf mgr - Function: salt.state - Result: Changed Started: - 20:25:58.789931 Duration: 630.89 ms
  Name: changed.mgr - Function: salt.runner - Result: Changed Started: - 20:25:59.421084 Duration: 724.884 ms
  Name: check mgr - Function: salt.state - Result: Changed Started: - 20:26:00.146300 Duration: 1031.956 ms
  Name: remove mgr.conf - Function: salt.state - Result: Changed Started: - 20:26:01.178581 Duration: 593.101 ms
  Name: reset systemctl mgr - Function: salt.state - Result: Changed Started: - 20:26:01.771947 Duration: 524.621 ms
  Name: reset ceph.conf mgr - Function: salt.state - Result: Changed Started: - 20:26:02.296847 Duration: 3205.134 ms
  Name: redistribute ceph.conf mgr - Function: salt.state - Result: Changed Started: - 20:26:05.502248 Duration: 626.832 ms
  Name: changed.mgr - Function: salt.runner - Result: Changed Started: - 20:26:06.129343 Duration: 717.957 ms
  Name: check mgr again - Function: salt.state - Result: Changed Started: - 20:26:06.847650 Duration: 1036.565 ms
  Name: reset systemctl initially for mon - Function: salt.state - Result: Changed Started: - 20:26:07.884498 Duration: 596.97 ms
  Name: grains.setval - Function: module.run - Result: Changed Started: - 20:26:08.481809 Duration: 8.488 ms
  Name: check mon no restart - Function: salt.state - Result: Changed Started: - 20:26:08.490536 Duration: 11324.294 ms
  Name: check mon forced restart - Function: salt.state - Result: Changed Started: - 20:26:19.815105 Duration: 1379.876 ms
  Name: change mon.conf - Function: salt.state - Result: Changed Started: - 20:26:21.195248 Duration: 504.636 ms
  Name: create ceph.conf mon - Function: salt.state - Result: Changed Started: - 20:26:21.700151 Duration: 3185.81 ms
  Name: distribute ceph.conf mon - Function: salt.state - Result: Changed Started: - 20:26:24.886223 Duration: 616.424 ms
  Name: changed.mon - Function: salt.runner - Result: Changed Started: - 20:26:25.502937 Duration: 721.778 ms
  Name: check mon - Function: salt.state - Result: Changed Started: - 20:26:26.225141 Duration: 1047.84 ms
  Name: remove mon.conf - Function: salt.state - Result: Changed Started: - 20:26:27.273262 Duration: 605.515 ms
  Name: reset systemctl mon - Function: salt.state - Result: Changed Started: - 20:26:27.879037 Duration: 500.246 ms
  Name: reset ceph.conf mon - Function: salt.state - Result: Changed Started: - 20:26:28.379569 Duration: 3166.71 ms
  Name: redistribute ceph.conf mon - Function: salt.state - Result: Changed Started: - 20:26:31.546553 Duration: 671.687 ms
  Name: changed.mon - Function: salt.runner - Result: Changed Started: - 20:26:32.218512 Duration: 726.618 ms
  Name: check mon again - Function: salt.state - Result: Changed Started: - 20:26:32.945461 Duration: 955.547 ms
  Name: reset systemctl initially for rgw - Function: salt.state - Result: Changed Started: - 20:26:33.901277 Duration: 600.643 ms
  Name: grains.setval - Function: module.run - Result: Changed Started: - 20:26:34.502193 Duration: 7.478 ms
  Name: check rgw no restart - Function: salt.state - Result: Changed Started: - 20:26:34.509910 Duration: 11043.777 ms
  Name: check rgw forced restart - Function: salt.state - Result: Changed Started: - 20:26:45.553957 Duration: 1572.355 ms
  Name: change rgw.conf - Function: salt.state - Result: Changed Started: - 20:26:47.126620 Duration: 571.434 ms
  Name: create ceph.conf rgw - Function: salt.state - Result: Changed Started: - 20:26:47.698327 Duration: 3339.238 ms
  Name: distribute ceph.conf rgw - Function: salt.state - Result: Changed Started: - 20:26:51.037881 Duration: 658.529 ms
  Name: changed.rgw - Function: salt.runner - Result: Changed Started: - 20:26:51.696676 Duration: 940.048 ms
  Name: check rgw - Function: salt.state - Result: Changed Started: - 20:26:52.637073 Duration: 983.008 ms
  Name: remove rgw.conf - Function: salt.state - Result: Changed Started: - 20:26:53.620359 Duration: 562.501 ms
  Name: reset systemctl rgw - Function: salt.state - Result: Changed Started: - 20:26:54.183141 Duration: 511.872 ms
  Name: reset ceph.conf rgw - Function: salt.state - Result: Changed Started: - 20:26:54.695300 Duration: 3155.921 ms
  Name: redistribute ceph.conf rgw - Function: salt.state - Result: Changed Started: - 20:26:57.851489 Duration: 776.776 ms
  Name: changed.rgw - Function: salt.runner - Result: Changed Started: - 20:26:58.628537 Duration: 925.593 ms
  Name: check rgw again - Function: salt.state - Result: Changed Started: - 20:26:59.554450 Duration: 903.61 ms
  Name: enforce apparmor profiles - Function: salt.state - Result: Changed Started: - 20:27:00.458329 Duration: 32627.711 ms
  Name: make sure ceph cluster is healthy - Function: salt.state - Result: Changed Started: - 20:27:33.086306 Duration: 6990.065 ms
  Name: apply ceph.openstack - Function: salt.state - Result: Changed Started: - 20:27:40.076642 Duration: 8410.418 ms
  Name: verify users - Function: salt.state - Result: Changed Started: - 20:27:48.487331 Duration: 1525.778 ms
  Name: verify pools - Function: salt.state - Result: Changed Started: - 20:27:50.013386 Duration: 794.717 ms
  Name: clean environment at end - Function: salt.state - Result: Changed Started: - 20:27:50.808363 Duration: 7582.884 ms
  Name: apply ceph.openstack (prefix=smoketest) - Function: salt.state - Result: Changed Started: - 20:27:58.391540 Duration: 8837.696 ms
  Name: verify users (prefix=smoketest) - Function: salt.state - Result: Changed Started: - 20:28:07.229510 Duration: 1520.477 ms
  Name: verify pools (prefix=smoketest) - Function: salt.state - Result: Changed Started: - 20:28:08.750255 Duration: 794.382 ms
  Name: clean environment at end (prefix=smoketest) - Function: salt.state - Result: Changed Started: - 20:28:09.544903 Duration: 8101.626 ms
  Name: backup default proposal - Function: salt.state - Result: Changed Started: - 20:28:17.646796 Duration: 521.402 ms
  Name: save pillar data - Function: salt.state - Result: Changed Started: - 20:28:18.168510 Duration: 6257.339 ms
  Name: populate.engulf_existing_cluster - Function: salt.runner - Result: Changed Started: - 20:28:24.426126 Duration: 16055.755 ms
----------
          ID: verify storage profiles
    Function: salt.state
      Result: False
     Comment: Run failed on minions: target192168000053.teuthology
     Started: 20:28:40.482266
    Duration: 399.246 ms
     Changes:
              target192168000053.teuthology:
              ----------
                        ID: diff -x '\.filter' -ur /srv/pillar/ceph/proposals/profile-default /srv/pillar/ceph/proposals/profile-import
                  Function: cmd.run
                    Result: False
                   Comment: Command "diff -x '\.filter' -ur /srv/pillar/ceph/proposals/profile-default /srv/pillar/ceph/proposals/profile-import" run
                   Started: 20:28:40.847293
                  Duration: 20.23 ms
                   Changes:
                            ----------
                            pid:
                                45913
                            retcode:
                                1
                            stderr:
                            stdout:
                                diff -x '\.filter' -ur /srv/pillar/ceph/proposals/profile-default/stack/default/ceph/minions/target192168000053.teuthology.yml /srv/pillar/ceph/proposals/profile-import/stack/default/ceph/minions/target192168000053.teuthology.yml
                                --- /srv/pillar/ceph/proposals/profile-default/stack/default/ceph/minions/target192168000053.teuthology.yml 2018-11-06 20:16:25.159820850 +0000
                                +++ /srv/pillar/ceph/proposals/profile-import/stack/default/ceph/minions/target192168000053.teuthology.yml  2018-11-06 20:28:35.706167440 +0000
                                @@ -1,11 +1,3 @@
                                 ceph:
                                   storage:
                                -    osds:
                                -      /dev/vdb:
                                -        format: bluestore
                                -      /dev/vdc:
                                -        format: bluestore
                                -      /dev/vdd:
                                -        format: bluestore
                                -      /dev/vde:
                                -        format: bluestore
                                +    osds: {}

              Summary for target192168000053.teuthology
              ------------
              Succeeded: 0 (changed=1)
              Failed:    1
              ------------
              Total states run:     1
              Total run time:  20.230 ms
  Name: push.proposal - Function: salt.runner - Result: Changed Started: - 20:28:40.881832 Duration: 328.503 ms
  Name: refresh_pillar1 - Function: salt.state - Result: Changed Started: - 20:28:41.210671 Duration: 527.194 ms
----------
          ID: verify pillar data
    Function: salt.state
      Result: False
     Comment: Run failed on minions: target192168000053.teuthology
     Started: 20:28:41.738157
    Duration: 11804.679 ms
     Changes:
              target192168000053.teuthology:
                Name: salt-call pillar.items --out=yaml > /tmp/pillar-post-engulf.yml - Function: cmd.run - Result: Changed Started: - 20:28:42.252015 Duration: 5745.0 ms
              ----------
                        ID: salt-call functest.verify_engulf
                  Function: cmd.run
                    Result: False
                   Comment: Command "salt-call functest.verify_engulf" run
                   Started: 20:28:47.997473
                  Duration: 5524.994 ms
                   Changes:
                            ----------
                            pid:
                                46037
                            retcode:
                                1
                            stderr:
                                [DEBUG   ] Configuration file path: /etc/salt/minion
                                [WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
                                [DEBUG   ] Grains refresh requested. Refreshing grains.
                                [DEBUG   ] Reading configuration from /etc/salt/minion
                                [DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
                                [DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
                                [DEBUG   ] Including configuration from '/etc/salt/minion.d/master.conf'
                                [DEBUG   ] Reading configuration from /etc/salt/minion.d/master.conf
                                [DEBUG   ] Including configuration from '/etc/salt/minion.d/mine_functions.conf'
                                [DEBUG   ] Reading configuration from /etc/salt/minion.d/mine_functions.conf
                                [DEBUG   ] Error while parsing IPv4 address: ::1
                                [DEBUG   ] Expected 4 octets in '::1'
                                [DEBUG   ] Error while parsing IPv4 address: fe80::f816:3eff:feed:3707
                                [DEBUG   ] Expected 4 octets in 'fe80::f816:3eff:feed:3707'
                                [DEBUG   ] Please install 'virt-what' to improve results of the 'virtual' grain.
                                [DEBUG   ] Loading static grains from /etc/salt/grains
                                [DEBUG   ] Connecting to master. Attempt 1 of 1
                                [DEBUG   ] Error while parsing IPv4 address: target192168000053.teuthology
                                [DEBUG   ] Expected 4 octets in 'target192168000053.teuthology'
                                [DEBUG   ] Error while parsing IPv6 address: target192168000053.teuthology
                                [DEBUG   ] At least 3 parts expected in 'target192168000053.teuthology'
                                [DEBUG   ] Master URI: tcp://192.168.0.53:4506
                                [DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'target192168000053.teuthology', 'tcp://192.168.0.53:4506')
                                [DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (2120)
                                [DEBUG   ] Setting zmq_reconnect_ivl to '2120ms'
                                [DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
                                [DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'target192168000053.teuthology', 'tcp://192.168.0.53:4506', 'clear')
                                [DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://192.168.0.53:4506
                                [DEBUG   ] Trying to connect to: tcp://192.168.0.53:4506
                                [DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
                                [DEBUG   ] Decrypting the current master AES key
                                [DEBUG   ] salt.crypt.get_rsa_key: Loading private key
                                [DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
                                [DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
                                [DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
                                [DEBUG   ] Connecting the Minion to the Master publish port, using the URI: tcp://192.168.0.53:4505
                                [DEBUG   ] salt.crypt.get_rsa_key: Loading private key
                                [DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
                                [DEBUG   ] Determining pillar cache
                                [DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'target192168000053.teuthology', 'tcp://192.168.0.53:4506', 'aes')
                                [DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'target192168000053.teuthology', 'tcp://192.168.0.53:4506')
                                [DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://192.168.0.53:4506
                                [DEBUG   ] Trying to connect to: tcp://192.168.0.53:4506
                                [DEBUG   ] salt.crypt.get_rsa_key: Loading private key
                                [DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
                                [DEBUG   ] LazyLoaded jinja.render
                                [DEBUG   ] LazyLoaded yaml.render
                                [DEBUG   ] LazyLoaded functest.verify_engulf
                                [ERROR   ] An un-handled exception was caught by salt's global exception handler:
                                RuntimeError: unexpected pillar mismatch after engulf
                                Traceback (most recent call last):
                                  File "/usr/bin/salt-call", line 11, in <module>
                                    salt_call()
                                  File "/usr/lib/python3.6/site-packages/salt/scripts.py", line 400, in salt_call
                                    client.run()
                                  File "/usr/lib/python3.6/site-packages/salt/cli/call.py", line 57, in run
                                    caller.run()
                                  File "/usr/lib/python3.6/site-packages/salt/cli/caller.py", line 134, in run
                                    ret = self.call()
                                  File "/usr/lib/python3.6/site-packages/salt/cli/caller.py", line 212, in call
                                    ret['return'] = func(*args, **kwargs)
                                  File "/var/cache/salt/minion/extmods/modules/functest.py", line 52, in verify_engulf
                                    raise RuntimeError("unexpected pillar mismatch after engulf")
                                RuntimeError: unexpected pillar mismatch after engulf
                                Traceback (most recent call last):
                                  File "/usr/bin/salt-call", line 11, in <module>
                                    salt_call()
                                  File "/usr/lib/python3.6/site-packages/salt/scripts.py", line 400, in salt_call
                                    client.run()
                                  File "/usr/lib/python3.6/site-packages/salt/cli/call.py", line 57, in run
                                    caller.run()
                                  File "/usr/lib/python3.6/site-packages/salt/cli/caller.py", line 134, in run
                                    ret = self.call()
                                  File "/usr/lib/python3.6/site-packages/salt/cli/caller.py", line 212, in call
                                    ret['return'] = func(*args, **kwargs)
                                  File "/var/cache/salt/minion/extmods/modules/functest.py", line 52, in verify_engulf
                                    raise RuntimeError("unexpected pillar mismatch after engulf")
                                RuntimeError: unexpected pillar mismatch after engulf
                            stdout:

              Summary for target192168000053.teuthology
              ------------
              Succeeded: 1 (changed=2)
              Failed:    1
              ------------
              Total states run:     2
              Total run time:  11.270 s
  Name: restore default proposal - Function: salt.state - Result: Changed Started: - 20:28:53.543316 Duration: 951.779 ms
  Name: push.proposal - Function: salt.runner - Result: Changed Started: - 20:28:54.495364 Duration: 386.746 ms
  Name: refresh_pillar2 - Function: salt.state - Result: Changed Started: - 20:28:54.882403 Duration: 518.898 ms

Summary for target192168000053.teuthology_master
-------------
Succeeded: 88 (changed=89)
Failed:     2
-------------
Total states run:     90
Total run time:  261.185 s

I'd appreciate help figuring out what's wrong - i.e. is it a bug in my policy.cfg generation code, or is it a bug in the functests?

Here is a full teuthology log showing the failure. Between Stages 1 and 2 the full policy.cfg/storage profile generation process is shown.

http://10.86.0.135/ubuntu-2018-11-06_20:07:05-suse-wip-qa-storage-roles---basic-openstack/262/teuthology.log

smithfarm commented 5 years ago

@jschmid1 @tserong Any help appreciated. Saw this for the first time today, which makes me wonder what was merged recently that might have caused this. . . it might be that I changed the policy.cfg/profile generation code in deepsea.py, but I don't remember doing so.

tserong commented 5 years ago

The engulf functest checks a couple of things.

One is whether the imported storage profiles match the existing ones, which they don't here -- note how the diff of default and imported storage profiles shows the imported ones having an empty list of OSDs? I assume this is because we're running against Nautilus, which doesn't have ceph-disk, but the engulf function (via cephinspector.py) still uses ceph-disk to figure out what disks are configured.

The other thing the engulf does, is compare the pillar data across all nodes before and after the engulf, to see if they match (modulo a couple of allowed discrepancies). Unhelpfully, it's not actually telling us what didn't match, although it may just be a flow-on effect of the storage profile failure. Is there any way to see the contents of /tmp/pillar-pre-engulf.yml and /tmp/pillar-post-engulf.yml, or has that since evaporated?

smithfarm commented 5 years ago

So, no, this is not Nautilus - it's Mimic, so ceph-disk is present. I will obtain and post the contents of those files. Thanks, @tserong

tserong commented 5 years ago

Thanks @smithfarm (I should probably make the "unexpected pillar mismatch after engulf" thingy print out what actually doesn't match...)

smithfarm commented 5 years ago

This appears to be a regression introduced in 0.9.8. Given DeepSea 0.9.7 and Ceph 13.2.2, the functests pass.

I updated the issue description to include info about Ceph version and DeepSea version.

smithfarm commented 5 years ago

@tserong There are multiple failures in ceph.functests.1node with DeepSea master and Nautilus. Since the Nautilus RPMs are already in D:S:6.0 and SES6 product, I'll leave it to the DS developers to reproduce and fix.

smithfarm commented 5 years ago

@tserong OK, I got the files you asked for:

target192168000068:/home/ubuntu # cat /tmp/pillar-pre-engulf.yml
local:
  available_roles:
  - storage
  - admin
  - mon
  - mds
  - mgr
  - igw
  - openattic
  - rgw
  - ganesha
  - client-cephfs
  - client-radosgw
  - client-iscsi
  - client-nfs
  - benchmark-rbd
  - benchmark-blockdev
  - benchmark-fs
  - master
  benchmark:
    default-collection: simple.yml
    extra_mount_opts: nocrc
    job-file-directory: /run/ceph_bench_jobs
    log-file-directory: /var/log/ceph_bench_logs
    work-directory: /run/ceph_bench
  ceph:
    storage:
      osds:
        /dev/vdb:
          format: bluestore
        /dev/vdc:
          format: bluestore
        /dev/vdd:
          format: bluestore
        /dev/vde:
          format: bluestore
  cluster: ceph
  cluster_network: 192.168.0.0/24
  deepsea_minions: '*'
  fsid: edaed396-bf02-4241-afb7-216c565ca015
  public_network: 192.168.0.0/24
  roles:
  - master
  - admin
  - mon
  - mgr
  - mds
  - rgw
  - storage
  stage_prep_master: default-no-update-no-reboot
  stage_prep_minion: default-no-update-no-reboot
  time_server: target192168000068.teuthology

target192168000068:/home/ubuntu # cat /tmp/pillar-post-engulf.yml
local:
  available_roles:
  - storage
  - admin
  - mon
  - mds
  - mgr
  - igw
  - openattic
  - rgw
  - ganesha
  - client-cephfs
  - client-radosgw
  - client-iscsi
  - client-nfs
  - benchmark-rbd
  - benchmark-blockdev
  - benchmark-fs
  - master
  benchmark:
    default-collection: simple.yml
    extra_mount_opts: nocrc
    job-file-directory: /run/ceph_bench_jobs
    log-file-directory: /var/log/ceph_bench_logs
    work-directory: /run/ceph_bench
  cluster: ceph
  cluster_network: 192.168.0.0/24
  configuration_init: default-import
  deepsea_minions: '*'
  fsid: edaed396-bf02-4241-afb7-216c565ca015
  public_network: 192.168.0.0/24
  roles:
  - storage
  - master
  - mds
  - mgr
  - mon
  - rgw
  stage_prep_master: default-no-update-no-reboot
  stage_prep_minion: default-no-update-no-reboot
  time_server: target192168000068.teuthology

Just to reiterate, this is Ceph 13.2.2 and DeepSea 0.9.8. With DeepSea 0.9.7 this issue does not happen and the functests orchestration passes.

With Ceph 14.0.0 (and/or 14.0.1), the same failure appears, along with some other failures.

tserong commented 5 years ago

DeepSea 0.9.7 doesn't include the engulf functest at all (they were introduced by #1387, although functests were passing in that PR).

From the diff of those two tmp files, the only thing it should be whining about is the missing OSD information, although I still don't know why it didn't detect the OSDs / write them correctly to the profile-import directory.

tserong commented 5 years ago

The problem is caused by DeepSea now deploying LVM OSDs.

The engulf function calls cephinspector.get_ceph_disks_yml(), which in turn runs ceph-disk list, then iterates through the output looking for OSD data partitions, in order to generate the imported profiles. For LVM volumes, ceph-disk list comes back with the partition type "other", which cephinspector.get_ceph_disks_yml() skips over (it's only looking for part_dict["type"] == "data").

SUSE / DeepSea

SES6: ceph.functests.1node fails to engulf new-style test cluster ("unexpected pillar mismatch after engulf") #1453