DataDog / integrations-core

Core integrations of the Datadog Agent
BSD 3-Clause "New" or "Revised" License
913 stars 1.39k forks source link

[ceph] luminous support for service check? #889

Closed Darwiner closed 6 years ago

Darwiner commented 6 years ago

Output of the info page

====================
Collector (v 5.19.0)
====================

  Status date: 2017-11-20 08:45:47 (8s ago)
  Pid: 2227
  Platform: Linux-4.13.4-1-pve-x86_64-with-debian-9.1
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/collector.log, syslog:/dev/log

  Clocks
  ======

    NTP offset: -0.0013 s
    System UTC time: 2017-11-20 16:45:56.106534

  Paths
  =====

    conf.d: /etc/dd-agent/conf.d
    checks.d: /opt/datadog-agent/agent/checks.d

  Hostnames
  =========

    socket-hostname: van-cl01-proxmox01
    hostname: van-cl01-proxmox01.example.net
    socket-fqdn: van-cl01-proxmox01.example.net

  Checks
  ======

    ceph (5.19.0)
    -------------
      - instance #0 [OK]
      - Collected 24 metrics, 0 events & 1 service check

    ntp (5.19.0)
    ------------
      - Collected 0 metrics, 0 events & 0 service checks

    disk (5.19.0)
    -------------
      - instance #0 [OK]
      - Collected 32 metrics, 0 events & 0 service checks

    network (5.19.0)
    ----------------
      - instance #0 [OK]
      - Collected 37 metrics, 0 events & 0 service checks

  Emitters
  ========

    - http_emitter [OK]

====================
Dogstatsd (v 5.19.0)
====================

  Status date: 2017-11-20 08:45:51 (5s ago)
  Pid: 2224
  Platform: Linux-4.13.4-1-pve-x86_64-with-debian-9.1
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/dogstatsd.log, syslog:/dev/log

  Flush count: 85612
  Packet Count: 1056572
  Packets per second: 1.9
  Metric count: 11
  Event count: 0
  Service check count: 0

====================
Forwarder (v 5.19.0)
====================

  Status date: 2017-11-20 08:45:55 (1s ago)
  Pid: 2223
  Platform: Linux-4.13.4-1-pve-x86_64-with-debian-9.1
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/forwarder.log, syslog:/dev/log

  Queue Size: 414 bytes
  Queue Length: 1
  Flush Count: 279788
  Transactions received: 208768
  Transactions flushed: 208767
  Transactions rejected: 0
  API Key Status: API Key is valid

======================
Trace Agent (v 5.19.0)
======================

  Pid: 2222
  Uptime: 856688 seconds
  Mem alloc: 1545992 bytes

  Hostname: van-cl01-proxmox01.example.net
  Receiver: localhost:8126
  API Endpoint: https://trace.agent.datadoghq.com

  Bytes sent (1 min): 0
  Traces sent (1 min): 0
  Stats sent (1 min): 0

Additional environment details (Operating System, Cloud provider, etc):

It would seem the current ceph service check might not support ceph 12.2.1 (luminous) yet?

Steps to reproduce the issue:

  1. ceph -s gives this output:

    # ceph -s
    cluster:
    id:     13d6ef40-32e6-4431-b7c3-098aadfff6e4
    health: HEALTH_OK
    
    services:
    mon: 3 daemons, quorum van-cl01-proxmox02,van-cl01-proxmox03,van-cl01-proxmox01
    mgr: van-cl01-proxmox02(active), standbys: van-cl01-proxmox03, van-cl01-proxmox01
    osd: 3 osds: 3 up, 3 in
    
    data:
    pools:   1 pools, 128 pgs
    objects: 0 objects, 0 bytes
    usage:   3201 MB used, 4721 GB / 4725 GB avail
    pgs:     128 active+clean

Describe the results you received:

Yet, in the UI, or monitoring pages, these hosts are reporting ceph as being in a warning state as far as the ceph.overall_status service check is concerned...

Considering the output of ceph -s has changed since the previous versions, I'm supposing there are some changes required to catch the correct status in this new version of ceph.

Describe the results you expected:

Monitoring should report that the cluster state is OK.

Additional information you deem important (e.g. issue happens only occasionally):

jeremy-lq commented 6 years ago

@Darwiner, I can confirm that the check currently doesn't fully support ceph 12.2.1 (luminous). We have a task on the backlog for this and will update this issue once a PR is merged.

Darwiner commented 6 years ago

@jeremy-lq Actually, FWIW, I just stumbled onto this...

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021011.html

It would seem that in luminous, the new "appropriate" field to use for obtaining cluster status is "status" instead of "overall_status", and that "overall_status" returning HEALTH_WARN is their way to get people to stop using that field.

At least, that's what I make of the answer...