ceph / ceph-ansible

Ansible playbooks to deploy Ceph, the distributed filesystem.
Apache License 2.0
1.69k stars 1.01k forks source link

Partial installation of cluster, only mons, osds and mgrs installed #5114

Closed gramallo closed 4 years ago

gramallo commented 4 years ago

Bug Report

What happened: Each time installation playbook is executed only groups with tasks executed are mons, ods, mgrs Other groups are skipped and nfss shows failed on recap with message

What you expected to happen:

Roles to be installed on these groups as per palybooks parameters

[rgws]
cnode04
cnode05

[rgwloadbalancers]
cnode04
cnode05

[grafana-server]
cnode04

[nfss]
cnode04
cnode05

[iscsigws]
cnode04
cnode05

How to reproduce it (minimal and precise):

ansible-playbook site.yml -vvv

Each time installation playbook is executed only groups with tasks executed are mons, ods, mgrs Other groups are skipped and nfss shows failed on recap with message

Install Ceph NFS               : In Progress (0:01:38)
    This phase can be restarted by running: roles/ceph-nfs/tasks/main.yml
[mons]
cnode01
cnode02
cnode03

[mgrs]
cnode01
cnode02
cnode03

[osds]
cnode01
cnode02
cnode03

[rgws]
cnode04
cnode05

[rgwloadbalancers]
cnode04
cnode05

[grafana-server]
cnode04

[nfss]
cnode04
cnode05

[iscsigws]
cnode04
cnode05

[clients]
cnode05

Share your group_vars files, inventory all.yml ceph-fetch-keys.yml factss.yml iscsigws.yml mgrs.yml mons.yml nfss.yml osds.yml rbdmirrors.yml rgwloadbalancers.yml rgws.yml

Environment:

dsavineau commented 4 years ago

Could you share the full ansible logs and variables defined in {group,host}_vars/*.yml files ?

gramallo commented 4 years ago

Hi Dimitry, Thanks for jumping into this. Attached are requested files I don't have any host_vars files

Recent changes Updated 2 below servers to linux kernel 5.5.7-1 to be supported for iscsi as per ceph docs

cnode04
cnode05

ansible.log

yml-files.zip

dsavineau commented 4 years ago

Recent changes Updated 2 below servers to linux kernel 5.5.7-1 to be supported for iscsi as per ceph docs

cnode04
cnode05

Could you share the ceph docs you were looking at ?

AFAIK the ceph iscsi nodes support CentOS 7 (at least since 7.2 or 7.3) and that's not because the CentOS7 kernel doesn't match the version requirement that the features aren't present (there's a lot a backport into the CentOS/RHEL kernel) so 3.10 doesn't mean anything. BTW we're already testing if the required features are present in https://github.com/ceph/ceph-ansible/blob/stable-4.0/roles/ceph-validate/tasks/check_iscsi.yml#L35-L44

About the issue on the nfss nodes:

  stderr: 'error opening pool cephfs_data: (2) No such file or directory'

The cephfs_data is created when using mds nodes but you're not using any mds nodes.

Why are you using ceph_nfs_rados_backend: true if you're not using any mds ?

gramallo commented 4 years ago

Was looking into this doc https://docs.ceph.com/docs/nautilus/rbd/iscsi-targets/

My Os and kernel on cnode4 and cnode5 CentOS Linux release 7.7.1908 (Core) with kernel 5.5.7-1.el7.elrepo.x86_64

I assume don't need to install iscsi required packages before running ansible for get iscsi installed as per doc?

https://docs.ceph.com/docs/nautilus/rbd/iscsi-target-ansible/

Checked /boot/config-5.5.7-1.el7.elrepo.x86_64 and can see below defined

CONFIG_TCM_USER2=m CONFIG_TARGET_CORE=m CONFIG_ISCSI_TARGET=m

Regarding the ceph_nfs_rados_backend: true question. Unsure if am using nfs_obj_gw: true so though was part of it but can not find documents to read on this

Will disable and re-run

gramallo commented 4 years ago

I reviewed all my YML files and changed the recommended value for ceph_nfs_rados_backend: true NFS tasks completed but I am hitting this error with ISCSI role

2020-03-09 15:52:09,655 p=9931 u=owolfi |  <cnode04> (1, '\n{"msg": "Could not find the requested service tcmu-runner: host", "failed": true, "invocation": {"module_args": {"no_block": false, "force": null, "name": "tcmu-runner", "daemon_reexec": false, "enabled": true, "daemon_reload": false, "state": "started", "user": null, "scope": null, "masked": false}}}\n', '')
2020-03-09 15:52:09,656 p=9931 u=owolfi |  <cnode04> Failed to connect to the host via ssh: 
2020-03-09 15:52:09,668 p=9931 u=owolfi |  fatal: [cnode04]: FAILED! => changed=false 
  invocation:
    module_args:
      daemon_reexec: false
      daemon_reload: false
      enabled: true
      force: null
      masked: false
      name: tcmu-runner
      no_block: false
      scope: null
      state: started
      user: null
  msg: 'Could not find the requested service tcmu-runner: host'
2020-03-09 15:52:09,723 p=9931 u=owolfi |  <cnode05> (1, '\n{"msg": "Could not find the requested service tcmu-runner: host", "failed": true, "invocation": {"module_args": {"no_block": false, "force": null, "name": "tcmu-runner", "daemon_reexec": false, "enabled": true, "daemon_reload": false, "state": "started", "user": null, "scope": null, "masked": false}}}\n', '')
2020-03-09 15:52:09,723 p=9931 u=owolfi |  <cnode05> Failed to connect to the host via ssh: 
2020-03-09 15:52:09,734 p=9931 u=owolfi |  fatal: [cnode05]: FAILED! => changed=false 
  invocation:
    module_args:
      daemon_reexec: false
      daemon_reload: false
      enabled: true
      force: null
      masked: false
      name: tcmu-runner
      no_block: false
      scope: null
      state: started
      user: null
  msg: 'Could not find the requested service tcmu-runner: host'
2020-03-09 15:52:09,735 p=9931 u=owolfi |  NO MORE HOSTS LEFT ****************************************************************************************************************************************************************
2020-03-09 15:52:09,737 p=9931 u=owolfi |  PLAY RECAP ************************************************************************************************************************************************************************
2020-03-09 15:52:09,737 p=9931 u=owolfi |  cnode01                    : ok=267  changed=47   unreachable=0    failed=0    skipped=374  rescued=0    ignored=0   
2020-03-09 15:52:09,738 p=9931 u=owolfi |  cnode02                    : ok=236  changed=40   unreachable=0    failed=0    skipped=356  rescued=0    ignored=0   
2020-03-09 15:52:09,738 p=9931 u=owolfi |  cnode03                    : ok=245  changed=42   unreachable=0    failed=0    skipped=350  rescued=0    ignored=0   
2020-03-09 15:52:09,738 p=9931 u=owolfi |  cnode04                    : ok=260  changed=52   unreachable=0    failed=1    skipped=366  rescued=0    ignored=0   
2020-03-09 15:52:09,738 p=9931 u=owolfi |  cnode05                    : ok=279  changed=44   unreachable=0    failed=1    skipped=417  rescued=0    ignored=0   
2020-03-09 15:52:09,739 p=9931 u=owolfi |  INSTALLER STATUS ******************************************************************************************************************************************************************
2020-03-09 15:52:09,743 p=9931 u=owolfi |  Install Ceph Monitor           : Complete (0:01:16)
2020-03-09 15:52:09,744 p=9931 u=owolfi |  Install Ceph Manager           : Complete (0:01:52)
2020-03-09 15:52:09,744 p=9931 u=owolfi |  Install Ceph OSD               : Complete (0:02:06)
2020-03-09 15:52:09,744 p=9931 u=owolfi |  Install Ceph RGW               : Complete (0:00:48)
2020-03-09 15:52:09,744 p=9931 u=owolfi |  Install Ceph NFS               : Complete (0:01:47)
2020-03-09 15:52:09,745 p=9931 u=owolfi |  Install Ceph Client            : Complete (0:00:23)
2020-03-09 15:52:09,745 p=9931 u=owolfi |  Install Ceph iSCSI Gateway     : In Progress (0:01:26)
2020-03-09 15:52:09,745 p=9931 u=owolfi |   This phase can be restarted by running: roles/ceph-iscsi-gw/tasks/main.yml
2020-03-09 15:52:09,746 p=9931 u=owolfi |  Monday 09 March 2020  15:52:09 +1300 (0:00:00.476)       0:14:07.538 ********** 
2020-03-09 15:52:09,746 p=9931 u=owolfi |  =============================================================================== 

Seems to be same as here if issue is caused by incorrect ISCSI configuration how can i find on logs what configuration is returning error?

https://github.com/ceph/ceph-ansible/issues/4857

This is my iscsigws.yml content (IP's removed)

---
# Variables here are applicable to all host groups NOT roles

# This sample file generated by generate_group_vars_sample.sh

# Dummy variable to avoid error because ansible does not recognize the
# file as a good configuration file when no variable in it.
dummy:

# You can override vars by using host or group vars

###########
# GENERAL #
###########
# Whether or not to generate secure certificate to iSCSI gateway nodes
#generate_crt: False

#iscsi_conf_overrides: {}
iscsi_pool_name: iscsi_pool
iscsi_pool_size: "{{ osd_pool_default_size }}"
deploy_settings: True
perform_system_checks: True
#copy_admin_key: True
gateway_iqn: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
gateway_ip_list: 
##################
# RBD-TARGET-API #
##################
# Optional settings related to the CLI/API service
api_user: admin
api_password: admin
api_port: 5000
api_secure: false
#loop_delay: 1
trusted_ip_list: 

##########
# DOCKER #
##########

# Resource limitation
# For the whole list of limits you can apply see: docs.docker.com/engine/admin/resource_constraints
# Default values are based from: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/red_hat_ceph_storage_hardware_guide/minimum_recommendations
# These options can be passed using the 'ceph_mds_docker_extra_env' variable.

# TCMU_RUNNER resource limitation
#ceph_tcmu_runner_docker_memory_limit: "{{ ansible_memtotal_mb }}m"
#ceph_tcmu_runner_docker_cpu_limit: 1

# RBD_TARGET_GW resource limitation
#ceph_rbd_target_gw_docker_memory_limit: "{{ ansible_memtotal_mb }}m"
#ceph_rbd_target_gw_docker_cpu_limit: 1

# RBD_TARGET_API resource limitation
#ceph_rbd_target_api_docker_memory_limit: "{{ ansible_memtotal_mb }}m"
#ceph_rbd_target_api_docker_cpu_limit: 1

Error trying to run the phase

ansible-playbook roles/ceph-iscsi-gw/tasks/main.yml -vvv

ansible-playbook 2.8.8
  config file = /usr/share/ceph-ansible/ansible.cfg
  configured module search path = [u'/usr/share/ceph-ansible/library']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible-playbook
  python version = 2.7.5 (default, Aug  7 2019, 00:51:29) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
Using /usr/share/ceph-ansible/ansible.cfg as config file
host_list declined parsing /etc/ansible/hosts as it did not pass it's verify_file() method
script declined parsing /etc/ansible/hosts as it did not pass it's verify_file() method
auto declined parsing /etc/ansible/hosts as it did not pass it's verify_file() method
[DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to allow bad characters in group names by default, this will change, but still be user configurable on deprecation. This feature will be removed in version 2.10. Deprecation warnings can be
disabled by setting deprecation_warnings=False in ansible.cfg.
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

Parsed /etc/ansible/hosts inventory source with ini plugin
ERROR! 'include_tasks' is not a valid attribute for a Play

The error appears to be in '/usr/share/ceph-ansible/roles/ceph-iscsi-gw/tasks/main.yml': line 2, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

---
- name: include common.yml
  ^ here