DataDog / integrations-core

Core integrations of the Datadog Agent
BSD 3-Clause "New" or "Revised" License
933 stars 1.4k forks source link

Oracle Integration is missing System Metrics #16574

Open Shigawire opened 10 months ago

Shigawire commented 10 months ago

Issue observed After installing the Oracle Integration, we do receive data from the instance via the Agent and the reported metrics are shown in the Datadog-provided "DBM Oracle Database Overview"-Dashboard:

image

However, we're missing data in these reports, as the dashboards show that some metrics are missing. In fact, the integration does query DBMS statistics, but does not report all of them back to the agent: https://github.com/DataDog/integrations-core/blob/master/oracle/datadog_checks/oracle/queries.py#L19

For example, the very important metric User Transaction Per Sec is not used by the integration.

Our Oracle database can report the following metrics via GV$SYSMETRIC:

Execute Without Parse Ratio
Soft Parse Ratio
User Calls Ratio
Host CPU Utilization (%)
Network Traffic Volume Per Sec
Enqueue Timeouts Per Sec
Enqueue Timeouts Per Txn
Enqueue Waits Per Sec
Enqueue Waits Per Txn
Enqueue Deadlocks Per Sec
Enqueue Deadlocks Per Txn
Enqueue Requests Per Sec
Enqueue Requests Per Txn
DB Block Gets Per Sec
DB Block Gets Per Txn
Consistent Read Gets Per Sec
Consistent Read Gets Per Txn
DB Block Changes Per Sec
DB Block Changes Per Txn
Consistent Read Changes Per Sec
Consistent Read Changes Per Txn
CPU Usage Per Sec
CPU Usage Per Txn
CR Blocks Created Per Sec
CR Blocks Created Per Txn
CR Undo Records Applied Per Sec
CR Undo Records Applied Per Txn
User Rollback UndoRec Applied Per Sec
User Rollback Undo Records Applied Per Txn
Leaf Node Splits Per Sec
Leaf Node Splits Per Txn
Branch Node Splits Per Sec
Branch Node Splits Per Txn
PX downgraded 1 to 25% Per Sec
PX downgraded 25 to 50% Per Sec
PX downgraded 50 to 75% Per Sec
PX downgraded 75 to 99% Per Sec
PX downgraded to serial Per Sec
Physical Read Total IO Requests Per Sec
Physical Read Total Bytes Per Sec
GC CR Block Received Per Second
GC CR Block Received Per Txn
GC Current Block Received Per Second
GC Current Block Received Per Txn
Global Cache Average CR Get Time
Global Cache Average Current Get Time
Physical Write Total IO Requests Per Sec
Global Cache Blocks Corrupted
Global Cache Blocks Lost
Current Logons Count
Current Open Cursors Count
User Limit %
SQL Service Response Time
Database Wait Time Ratio
Database CPU Time Ratio
Response Time Per Txn
Row Cache Hit Ratio
Row Cache Miss Ratio
Library Cache Hit Ratio
Library Cache Miss Ratio
Shared Pool Free %
PGA Cache Hit %
Process Limit %
Session Limit %
Executions Per Txn
Executions Per Sec
Txns Per Logon
Database Time Per Sec
Physical Write Total Bytes Per Sec
Physical Read IO Requests Per Sec
Physical Read Bytes Per Sec
Physical Write IO Requests Per Sec
Physical Write Bytes Per Sec
DB Block Changes Per User Call
DB Block Gets Per User Call
Executions Per User Call
Logical Reads Per User Call
Total Sorts Per User Call
Total Table Scans Per User Call
Current OS Load
Streams Pool Usage Percentage
PQ QC Session Count
PQ Slave Session Count
Queries parallelized Per Sec
DML statements parallelized Per Sec
DDL statements parallelized Per Sec
PX operations not downgraded Per Sec
Session Count
Average Synchronous Single-Block Read Latency
I/O Megabytes per Second
I/O Requests per Second
Average Active Sessions
Active Serial Sessions
Active Parallel Sessions
Captured user calls
Replayed user calls
Workload Capture and Replay status
Background CPU Usage Per Sec
Background Time Per Sec
Host CPU Usage Per Sec
Cell Physical IO Interconnect Bytes
Temp Space Used
Total PGA Allocated
Total PGA Used by SQL Workareas
Run Queue Per Sec
VM in bytes Per Sec
VM out bytes Per Sec
Buffer Cache Hit Ratio
Total PGA Used by SQL Workareas
User Transaction Per Sec
Physical Reads Per Sec
Physical Reads Per Txn
Physical Writes Per Sec
Physical Writes Per Txn
Physical Reads Direct Per Sec
Physical Reads Direct Per Txn
Redo Generated Per Sec
Redo Generated Per Txn
Logons Per Sec
Logons Per Txn
User Calls Per Sec
User Calls Per Txn
Logical Reads Per Sec
Logical Reads Per Txn
Redo Writes Per Sec
Redo Writes Per Txn
Total Table Scans Per Sec
Total Table Scans Per Txn
Full Index Scans Per Sec
Full Index Scans Per Txn
Execute Without Parse Ratio
Soft Parse Ratio
Host CPU Utilization (%)
DB Block Gets Per Sec
DB Block Gets Per Txn
Consistent Read Gets Per Sec
Consistent Read Gets Per Txn
DB Block Changes Per Sec
DB Block Changes Per Txn
Consistent Read Changes Per Sec
Consistent Read Changes Per Txn
Database CPU Time Ratio
Library Cache Hit Ratio
Shared Pool Free %
Executions Per Txn
Executions Per Sec
Txns Per Logon
Database Time Per Sec
Average Active Sessions
Host CPU Usage Per Sec
Cell Physical IO Interconnect Bytes
Temp Space Used
Total PGA Allocated
Memory Sorts Ratio

It's probably easy to incorporate the additional metrics into the reporting.

Output of the info page Note: Our Agent in Kubernetes can't execute s6-svstat:

s6-svstat: command not found
$ agent status

2024-01-09 09:19:36 UTC | CLUSTER | WARN | (pkg/util/log/log.go:666 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2024-01-09 09:19:36 UTC | CLUSTER | INFO | (pkg/util/log/log.go:626 in func1) | 2 Features detected from environment: kubernetes,orchestratorexplorer
Getting the status from the agent.

===============================
Datadog Cluster Agent (v7.50.2)
===============================

  Status date: 2024-01-09 09:19:36.358 UTC (1704791976358)
  Agent start: 2024-01-08 19:59:50.748 UTC (1704743990748)
  Pid: 1
  Go Version: go1.20.12
  Build arch: amd64
  Agent flavor: cluster_agent
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System time: 2024-01-09 09:19:36.358 UTC (1704791976358)

  Hostnames
  =========
    host_aliases: [a573212b-6d81-4052-93aa-671bef947fca]
    hostname: aks-system-27839219-vmss000001-aks-k8s-cluster-live
    socket-fqdn: datadog-agent-cluster-agent-58cdf78bb6-724p8
    socket-hostname: datadog-agent-cluster-agent-58cdf78bb6-724p8
    hostname provider: container
    unused hostname providers:
      'hostname' configuration/environment: hostname is empty
      'hostname_file' configuration/environment: 'hostname_file' configuration is not enabled
      aws: not retrieving hostname from AWS: the host is not an ECS instance and other providers already retrieve non-default hostnames
      azure: azure_hostname_style is set to 'os'
      fargate: agent is not runnning on Fargate
      fqdn: FQDN hostname is not usable
      gce: unable to retrieve hostname from GCE: GCE metadata API error: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
      os: OS hostname is not usable

  Metadata
  ========

Leader Election
===============
  Leader Election Status:  Running
  Leader Name is: datadog-agent-cluster-agent-58cdf78bb6-724p8
  Last Acquisition of the lease: Mon, 08 Jan 2024 20:00:19 UTC
  Renewed leadership: Tue, 09 Jan 2024 09:19:35 UTC
  Number of leader transitions: 622 transitions

Custom Metrics Server
=====================
  Disabled: The external metrics provider is not enabled on the Cluster Agent

Cluster Checks Dispatching
==========================
  Status: Leader, serving requests
  Active agents: 1
  Check Configurations: 3
    - Dispatched: 3
    - Unassigned: 0

Admission Controller
====================
  Disabled: The admission controller is not enabled on the Cluster Agent

=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.yaml
      Total Runs: 3,199
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 1, Total: 839
      Service Checks: Last Run: 4, Total: 12,788
      Average Execution Time : 1.968s
      Last Execution Date : 2024-01-09 09:19:22 UTC (1704791962000)
      Last Successful Execution Date : 2024-01-09 09:19:22 UTC (1704791962000)

    orchestrator
    ------------
      Instance ID: orchestrator:c640d4e943da6c1d [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/orchestrator.d/conf.yaml.default
      Total Runs: 4,799
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 10ms
      Last Execution Date : 2024-01-09 09:19:33 UTC (1704791973000)
      Last Successful Execution Date : 2024-01-09 09:19:33 UTC (1704791973000)

=========
Forwarder
=========

  Transactions
  ============
    Cluster: 4,796
    ClusterRole: 643
    ClusterRoleBinding: 297
    CronJob: 0
    CustomResource: 0
    CustomResourceDefinition: 0
    DaemonSet: 481
    Deployment: 660
    Dropped: 0
    HighPriorityQueueFull: 0
    HorizontalPodAutoscaler: 0
    Ingress: 1,653
    Job: 0
    Namespace: 4,796
    Node: 3,238
    OrchestratorManifest: 6,752
    PersistentVolume: 456
    PersistentVolumeClaim: 772
    Pod: 3
    ReplicaSet: 2,138
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Role: 389
    RoleBinding: 339
    Service: 818
    ServiceAccount: 521
    StatefulSet: 323
    VerticalPodAutoscaler: 0

  Transaction Successes
  =====================
    Total number: 35879
    Successes By Endpoint:
      check_run_v1: 3,199
      intake: 406
      orchestrator: 22,323
      orchmanifest: 6,752
      series_v2: 3,199

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

==========
Endpoints
==========
  https://app.datadoghq.eu - API Key ending with:
      - 74b2d

=============
Autodiscovery
=============
  Enabled Features
  ================
    kubernetes
    orchestratorexplorer

=====================
Orchestrator Explorer
=====================
  Collection Status: The collection is at least partially running since the cache has been populated.
  Cluster Name: aks-k8s-cluster-live
  Cluster ID: 43d27346-bae4-4e3a-998a-e8b0e6cfcb9e
  Container scrubbing: enabled
  Manifest collection: enabled

  ======================
  Orchestrator Endpoints
  ======================
    https://orchestrator.datadoghq.eu - API Key ending with: 74b2d

  ===========
  Cache Stats
  ===========
    Elements in the cache: 1114

    ClusterRoleBinding
      Last Run: (Hits: 96 Miss: 0) | Total: (Hits: 435744 Miss: 24672)

    ClusterRole
      Last Run: (Hits: 114 Miss: 0) | Total: (Hits: 517446 Miss: 29298)

    Cluster
      Last Run: (Hits: 0 Miss: 1) | Total: (Hits: 0 Miss: 4796)

    CronJob
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    DaemonSet
      Last Run: (Hits: 10 Miss: 0) | Total: (Hits: 45331 Miss: 2629)

    Deployment
      Last Run: (Hits: 36 Miss: 0) | Total: (Hits: 163362 Miss: 9294)

    Ingress
      Last Run: (Hits: 40 Miss: 0) | Total: (Hits: 181522 Miss: 10318)

    Job
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    Namespace
      Last Run: (Hits: 24 Miss: 1) | Total: (Hits: 108917 Miss: 10983)

    Node
      Last Run: (Hits: 10 Miss: 1) | Total: (Hits: 52164 Miss: 5171)

    PersistentVolumeClaim
      Last Run: (Hits: 21 Miss: 0) | Total: (Hits: 95296 Miss: 5420)

    PersistentVolume
      Last Run: (Hits: 26 Miss: 0) | Total: (Hits: 117988 Miss: 6708)

    Pod
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 31 Miss: 7)

    ReplicaSet
      Last Run: (Hits: 424 Miss: 0) | Total: (Hits: 1.928726e+06 Miss: 109576)

    RoleBinding
      Last Run: (Hits: 45 Miss: 0) | Total: (Hits: 204255 Miss: 11565)

    Role
      Last Run: (Hits: 44 Miss: 0) | Total: (Hits: 199716 Miss: 11308)

    ServiceAccount
      Last Run: (Hits: 99 Miss: 0) | Total: (Hits: 449262 Miss: 25542)

    Service
      Last Run: (Hits: 91 Miss: 1) | Total: (Hits: 417497 Miss: 23735)

    StatefulSet
      Last Run: (Hits: 11 Miss: 0) | Total: (Hits: 49928 Miss: 2828)

  =====================
  Manifest Buffer Stats
  =====================
  Buffer Flushed : 6752 times
  Last Time Flushed Manifests : 5
  ==============================
  Manifests Flushed Per Resource
  ==============================
    ClusterRole : 29298
    ClusterRoleBinding : 24672
    DaemonSet : 2629
    Deployment : 9294
    Ingress : 10318
    Namespace : 10983
    Node : 5171
    PersistentVolume : 6708
    PersistentVolumeClaim : 5420
    Pod : 7
    ReplicaSet : 109576
    Role : 11308
    RoleBinding : 11565
    Service : 23735
    ServiceAccount : 25542
    StatefulSet : 2828

Additional environment details (Operating System, Cloud provider, etc): We're running Agent (v7.50.2) in Azure Kubernetes, installed via Helm (Chart v3.50.2). We're using Oracle 19 for our RDBMS.

Steps to reproduce the issue:

  1. Have an Oracle 19
  2. Install the Agent via Kubernetes
  3. Configure the Agent to collect Oracle metrics
  4. See a half-empty dashboard

Describe the results you received: The Oracle integration reports a few metrics

Describe the results you expected: The Oracle integration reports all metrics.

Additional information you deem important (e.g. issue happens only occasionally):

Shigawire commented 10 months ago

Got an update on this one: Apparently, there is a "new" Oracle monitoring integration (oracle-dbm). This integration records way more metrics than the oracle integration written in Python. After moving to the oracle-dbm integration and enabling dbm: true in the configuration, we see all reported values.

However, the official instructions to set up a datadog user in the Oracle instance are slightly off: https://docs.datadoghq.com/database_monitoring/setup_oracle/selfhosted/?tab=noncdb#create-view

The dd_session view requires reading from the "x"-tables x$ksuse, x$kslwt and x$ksled but it's impossible to grant a select from those tables to the datadog-user. Hence, it's necessary to create views onto those tables and grant select permissions to the datadog user with the sysdba-user:

create view v_x$ksuse as select * from x$ksuse;
create view v_x$kslwt as select * from x$kslwt;
create view v_x$ksled as select * from x$ksled;

create synonym datadog.v_x$ksuse for v_x$ksuse;
create synonym datadog.v_x$kslwt for v_x$kslwt;
create synonym datadog.v_x$ksled for v_x$ksled;

and then create the dd_session-view with these new views instead:

FROM
  v_x$ksuse s,
  v_x$kslwt w,
  v_x$ksled e,
  v$sql sq,
  v$sql sq_prev,
  v$containers c,
  v$sqlcommand comm
WHERE

Then, the datadog-user can successfully report the metrics.