DataDog / integrations-core

Core integrations of the Datadog Agent
BSD 3-Clause "New" or "Revised" License
928 stars 1.4k forks source link

postgres - Out of Memory on Initialize #16175

Closed ashleywxwx closed 11 months ago

ashleywxwx commented 11 months ago

We're adding postgres monitoring to an exiting datadog agent and seeing an "Out of Memory" exception when trying to connect. We specifically get the error when the postgres agent checker tries to connect to our AWS RDS Postgres instance. I would appreciate any help troubleshooting this issue.

Output of the info page

===============
Agent (v7.49.0)
===============

  Status date: 2023-11-07 22:10:07.599 UTC (1699395007599)
  Agent start: 2023-11-07 20:23:23.814 UTC (1699388603814)
  Pid: 1
  Go Version: go1.20.10
  Python Version: 3.9.18
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 31µs
    System time: 2023-11-07 22:10:07.599 UTC (1699395007599)

  Host Info
  =========
    bootTime: 2023-08-23 14:48:37 UTC (1692802117000)
    hostId: ec2914bc-3c53-8c30-8208-c271a3a2d78a
    kernelArch: x86_64
    kernelVersion: 5.10.184-175.731.amzn2.x86_64
    os: linux
    platform: ubuntu
    platformFamily: debian
    platformVersion: 23.04
    procs: 151
    uptime: 1829h34m49s

  Hostnames
  =========
    cluster-name: dev-company
    ec2-hostname: ip-10-0-2-28.us-east-2.compute.internal
    host_aliases: [i-0b3db96a4153207be ip-10-0-2-28.us-east-2.compute.internal-dev-company]
    hostname: i-0b3db96a4153207be
    instance-id: i-0b3db96a4153207be
    socket-fqdn: datadog-agent-7x4bs
    socket-hostname: datadog-agent-7x4bs
    host tags:
      cluster_name:dev-company
      kube_cluster_name:dev-company
      kube_node:ip-10-0-2-28.us-east-2.compute.internal
    hostname provider: aws
    unused hostname providers:
      'hostname' configuration/environment: hostname is empty
      'hostname_file' configuration/environment: 'hostname_file' configuration is not enabled
      azure: azure_hostname_style is set to 'os'
      fargate: agent is not runnning on Fargate
      fqdn: FQDN hostname is not usable
      gce: unable to retrieve hostname from GCE: GCE metadata API error: status code 401 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
      os: OS hostname is not usable

  Metadata
  ========
    agent_version: 7.49.0
    cloud_provider: AWS
    config_apm_dd_url:
    config_dd_url:
    config_logs_dd_url:
    config_logs_socks5_proxy_address:
    config_no_proxy: [169.254.169.254 100.100.100.200]
    config_process_dd_url:
    config_proxy_http:
    config_proxy_https:
    config_site:
    feature_apm_enabled: true
    feature_cspm_enabled: false
    feature_cws_enabled: false
    feature_cws_network_enabled: true
    feature_cws_remote_config_enabled: true
    feature_cws_security_profiles_enabled: false
    feature_dynamic_instrumentation_enabled: false
    feature_fips_enabled: false
    feature_imdsv2_enabled: true
    feature_logs_enabled: true
    feature_networks_enabled: false
    feature_networks_http_enabled: false
    feature_networks_https_enabled: false
    feature_oom_kill_enabled: false
    feature_otlp_enabled: false
    feature_process_enabled: false
    feature_process_language_detection_enabled: false
    feature_processes_container_enabled: true
    feature_remote_configuration_enabled: true
    feature_tcp_queue_length_enabled: false
    feature_usm_enabled: false
    feature_usm_go_tls_enabled: false
    feature_usm_http2_enabled: false
    feature_usm_http_by_status_code_enabled: false
    feature_usm_istio_enabled: false
    feature_usm_java_tls_enabled: false
    feature_usm_kafka_enabled: false
    feature_windows_crash_detection_enabled: false
    flavor: agent
    hostname_source: aws
    install_method_installer_version: v1.2.0
    install_method_tool: datadog-operator
    install_method_tool_version: datadog-operator
    logs_transport: HTTP
    system_probe_core_enabled: true
    system_probe_gateway_lookup_enabled: true
    system_probe_kernel_headers_download_enabled: false
    system_probe_max_connections_per_message: 600
    system_probe_prebuilt_fallback_enabled: true
    system_probe_protocol_classification_enabled: true
    system_probe_root_namespace_enabled: true
    system_probe_runtime_compilation_enabled: false
    system_probe_telemetry_enabled: true
    system_probe_track_tcp_4_connections: true
    system_probe_track_tcp_6_connections: true
    system_probe_track_udp_4_connections: true
    system_probe_track_udp_6_connections: true

=========
Collector
=========

  Running Checks
  ==============

    cert_manager (3.1.1)
    --------------------
      Instance ID: cert_manager:2b1f188aabed9bba [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cert_manager.d/auto_conf.yaml
      Total Runs: 427
      Metric Samples: Last Run: 2, Total: 854
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 427
      Average Execution Time : 3ms
      Last Execution Date : 2023-11-07 22:09:56 UTC (1699394996000)
      Last Successful Execution Date : 2023-11-07 22:09:56 UTC (1699394996000)

    container
    ---------
      Instance ID: container [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/container.d/conf.yaml.default
      Total Runs: 426
      Metric Samples: Last Run: 760, Total: 323,760
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 14ms
      Last Execution Date : 2023-11-07 22:09:55 UTC (1699394995000)
      Last Successful Execution Date : 2023-11-07 22:09:55 UTC (1699394995000)

    containerd
    ----------
      Instance ID: containerd [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/containerd.d/conf.yaml.default
      Total Runs: 427
      Metric Samples: Last Run: 150, Total: 64,050
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 156ms
      Last Execution Date : 2023-11-07 22:10:02 UTC (1699395002000)
      Last Successful Execution Date : 2023-11-07 22:10:02 UTC (1699395002000)

    coredns (3.0.0)
    ---------------
      Instance ID: coredns:9089a795cd6c958f [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/coredns.d/auto_conf.yaml
      Total Runs: 427
      Metric Samples: Last Run: 161, Total: 68,747
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 427
      Average Execution Time : 15ms
      Last Execution Date : 2023-11-07 22:10:03 UTC (1699395003000)
      Last Successful Execution Date : 2023-11-07 22:10:03 UTC (1699395003000)

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 426
      Metric Samples: Last Run: 9, Total: 3,827
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-11-07 22:09:54 UTC (1699394994000)
      Last Successful Execution Date : 2023-11-07 22:09:54 UTC (1699394994000)

    cri
    ---
      Instance ID: cri [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cri.d/conf.yaml.default
      Total Runs: 427
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 13ms
      Last Execution Date : 2023-11-07 22:10:01 UTC (1699395001000)
      Last Successful Execution Date : 2023-11-07 22:10:01 UTC (1699395001000)

    disk (5.0.0)
    ------------
      Instance ID: disk:67cc0574430a16ba [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 426
      Metric Samples: Last Run: 902, Total: 384,252
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 47ms
      Last Execution Date : 2023-11-07 22:09:53 UTC (1699394993000)
      Last Successful Execution Date : 2023-11-07 22:09:53 UTC (1699394993000)

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 427
      Metric Samples: Last Run: 5, Total: 2,135
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-11-07 22:10:00 UTC (1699395000000)
      Last Successful Execution Date : 2023-11-07 22:10:00 UTC (1699395000000)

    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 427
      Metric Samples: Last Run: 41, Total: 17,480
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-11-07 22:10:07 UTC (1699395007000)
      Last Successful Execution Date : 2023-11-07 22:10:07 UTC (1699395007000)

    kubelet (7.9.2)
    ---------------
      Instance ID: kubelet:2b9bec749170d31d [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 321
      Metric Samples: Last Run: 1,083, Total: 347,607
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 5, Total: 1,604
      Average Execution Time : 287ms
      Last Execution Date : 2023-11-07 22:10:06 UTC (1699395006000)
      Last Successful Execution Date : 2023-11-07 22:10:06 UTC (1699395006000)

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 427
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-11-07 22:09:59 UTC (1699394999000)
      Last Successful Execution Date : 2023-11-07 22:09:59 UTC (1699394999000)

    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 427
      Metric Samples: Last Run: 6, Total: 2,562
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-11-07 22:10:06 UTC (1699395006000)
      Last Successful Execution Date : 2023-11-07 22:10:06 UTC (1699395006000)

    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 427
      Metric Samples: Last Run: 20, Total: 8,540
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-11-07 22:09:58 UTC (1699394998000)
      Last Successful Execution Date : 2023-11-07 22:09:58 UTC (1699394998000)

    network (3.0.0)
    ---------------
      Instance ID: network:4b0649b7e11f0772 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 427
      Metric Samples: Last Run: 214, Total: 91,378
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 3ms
      Last Execution Date : 2023-11-07 22:10:05 UTC (1699395005000)
      Last Successful Execution Date : 2023-11-07 22:10:05 UTC (1699395005000)

    ntp
    ---
      Instance ID: ntp:3c427a42a70bbf8 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 8
      Metric Samples: Last Run: 1, Total: 8
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 8
      Average Execution Time : 0s
      Last Execution Date : 2023-11-07 22:08:26 UTC (1699394906000)
      Last Successful Execution Date : 2023-11-07 22:08:26 UTC (1699394906000)

    postgres (15.1.1)
    -----------------
      Instance ID: postgres:941660ed43814218 [ERROR]
      Configuration Source: kube_services:kube_service://my-system/postgres
      Total Runs: 312
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 66ms
      Last Execution Date : 2023-11-07 22:10:04 UTC (1699395004000)
      Last Successful Execution Date : Never
      metadata:
        resolved_hostname: devdb-postgres.redacted.us-east-2.rds.amazonaws.com
      Error: out of memory

      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/base/checks/base.py", line 1210, in run
          initialization()
        File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/postgres/postgres.py", line 714, in _connect
          with self.db():
        File "/opt/datadog-agent/embedded/lib/python3.9/contextlib.py", line 119, in __enter__
          return next(self.gen)
        File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/postgres/postgres.py", line 197, in db
          self._db = self._new_connection(self._config.dbname)
        File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/datadog_checks/postgres/postgres.py", line 703, in _new_connection
          conn = psycopg2.connect(**args)
        File "/opt/datadog-agent/embedded/lib/python3.9/site-packages/psycopg2/__init__.py", line 127, in connect
          conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
      psycopg2.OperationalError: out of memory

    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 427
      Metric Samples: Last Run: 1, Total: 427
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-11-07 22:09:57 UTC (1699394997000)
      Last Successful Execution Date : 2023-11-07 22:09:57 UTC (1699394997000)

========
JMXFetch
========

  Information
  ==================
  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    Cluster: 0
    ClusterRole: 0
    ClusterRoleBinding: 0
    CronJob: 0
    CustomResource: 0
    CustomResourceDefinition: 0
    DaemonSet: 0
    Deployment: 0
    Dropped: 0
    HighPriorityQueueFull: 0
    HorizontalPodAutoscaler: 0
    Ingress: 0
    Job: 0
    Namespace: 0
    Node: 0
    OrchestratorManifest: 0
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 0
    ReplicaSet: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Role: 0
    RoleBinding: 0
    Service: 0
    ServiceAccount: 0
    StatefulSet: 0
    VerticalPodAutoscaler: 0

  Transaction Successes
  =====================
    Total number: 901
    Successes By Endpoint:
      check_run_v1: 426
      intake: 38
      metadata_v1: 11
      series_v2: 426

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

  API Keys status
  ===============
    API key ending with 10ac1: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 10ac1

==========
Logs Agent
==========
    Reliable: Sending compressed logs in HTTPS to agent-http-intake.logs.datadoghq.com on port 443
    BytesSent: 9.964603e+06
    EncodedBytesSent: 1.80612e+06
    LogsProcessed: 6929
    LogsSent: 6928
    CoreAgentProcessOpenFiles: 49
    OSFileLimit: 1048576

  ============
  Integrations
  ============

 ....Redacted

=============
Process Agent
=============

  Version: 7.49.0
  Status date: 2023-11-07 22:10:07.614 UTC (1699395007614)
  Process Agent Start: 2023-11-07 20:23:25.293 UTC (1699388605293)
  Pid: 1
  Go Version: go1.20.10
  Build arch: amd64
  Log Level: info
  Enabled Checks: [pod process rtprocess]
  Allocated Memory: 21,643,080 bytes
  Hostname: i-0b3db96a4153207be
  System Probe Process Module Status: Not running
  Process Language Detection Enabled: False

  =================
  Process Endpoints
  =================
    https://process.datadoghq.com - API Key ending with:
        - 10ac1

  =========
  Collector
  =========
    Last collection time: 2023-11-07 22:10:06
    Docker socket:
    Number of processes: 79
    Number of containers: 23
    Process Queue length: 0
    RTProcess Queue length: 0
    Connections Queue length: 0
    Event Queue length: 0
    Pod Queue length: 0
    Process Bytes enqueued: 0
    RTProcess Bytes enqueued: 0
    Connections Bytes enqueued: 0
    Event Bytes enqueued: 0
    Pod Bytes enqueued: 0
    Drop Check Payloads: []

  ==========
  Extractors
  ==========

    Workloadmeta
    ============
      Cache size: 0
      Stale diffs discarded: 0
      Diffs dropped: 0

=========
APM Agent
=========
  Status: Running
  Pid: 1
  Uptime: 6403 seconds
  Mem alloc: 12,086,152 bytes
  Hostname: i-0b3db96a4153207be
  Receiver: 0.0.0.0:8126
  Endpoints:
    https://trace.agent.datadoghq.com

  Receiver (previous minute)
  ==========================
    From nodejs v20.8.1 (v8), client 4.16.0
      Traces received: 128 (117,448 bytes)
      Spans received: 128

    Priority sampling rate for 'service:company-api-redis,env:dev': 100.0%

  Writer (previous minute)
  ========================
    Traces: 0 payloads, 0 traces, 0 events, 0 bytes
    Stats: 0 payloads, 0 stats buckets, 0 bytes

==========
Aggregator
==========
  Checks Metric Sample: 1,556,448
  Dogstatsd Metric Sample: 134,326
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 426
  Series Flushed: 1,263,844
  Service Check: 2,893
  Service Checks Flushed: 3,311

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 134,325
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 23,390,557
  Udp Packet Reading Errors: 0
  Udp Packets: 62,996
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 1
  Unterminated Metric Errors: 0

=====================
Datadog Cluster Agent
=====================

  - Datadog Cluster Agent endpoint detected: https://172.20.186.62:5005
  Successfully connected to the Datadog Cluster Agent.
  - Running: 7.49.0+commit.9cb8e9c

=============
Autodiscovery
=============
  Enabled Features
  ================
    containerd
    cri
    kubernetes
    orchestratorexplorer

====================
Remote Configuration
====================

    Organization enabled: False
    API Key: Not authorized, add the Remote Configuration Read permission to enable it for this agent.
    Last error: None

====
OTLP
====

  Status: Not enabled
  Collector status: Not running

Postgres check configured via Service definition

apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: my-system
  labels:
    tags.datadoghq.com/env: dev
  annotations:
    ad.datadoghq.com/service.check_names: '["postgres"]'
    ad.datadoghq.com/service.init_configs: '[{}]'
    ad.datadoghq.com/service.instances: |
      [
        {
          "dbm": true,
          "host": "devdb-postgres.redacted.us-east-2.rds.amazonaws.com",
          "port": 5432,
          "username": "datadog",
          "dbname": "api",
          "tags": [
            "dbinstanceidentifier:devdb-postgres"
          ],
          "aws": {
            "instance_endpoint": "devdb-postgres.redacted.us-east-2.rds.amazonaws.com",
            "region": "us-east-2"
          }
        }
      ]
spec:
  ports:
    - port: 5432
      protocol: TCP
      targetPort: 5432
      name: postgres

Agent Configuration

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: my-system
spec:
  features:
    apm:
      enabled: true
      hostPortConfig:
        enabled: true
    eventCollection:
      collectKubernetesEvents: true
    liveProcessCollection:
      enabled: true
    logCollection:
      enabled: true
      containerCollectAll: true
    liveContainerCollection:
      enabled: true
    orchestratorExplorer:
      enabled: true
  global:
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
      appSecret:
        secretName: datadog-secret
        keyName: app-key
    site: datadoghq.com
  override:
    clusterAgent:
      image:
        name: gcr.io/datadoghq/cluster-agent:latest
    nodeAgent:
      image:
        name: gcr.io/datadoghq/agent:latest
      env:
      - name: "DD_EC2_PREFER_IMDSV2"
        value: "true"

IAM Role Configuration

data "aws_iam_policy_document" "db_monitor" {
  statement {
    effect = "Allow"
    actions = [
      "rds-db:connect"
    ]
    resources = [
      "arn:aws:rds-db:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:dbuser:${module.db.db_instance_resource_id}/${local.db_monitor_role_name}"
    ]
  }
}

data "aws_iam_policy_document" "db_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["rds.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "db_monitor" {
  name = local.db_monitor_role_name

  assume_role_policy = data.aws_iam_policy_document.db_assume_role.json

  inline_policy {
    name   = "DatadogRDSMonitoringPolicy"
    policy = data.aws_iam_policy_document.db_monitor.json
  }
}

Additional environment details (Operating System, Cloud provider, etc):

Steps to reproduce the issue:

  1. Grant Agent Access Basic docs and use IAM permissions managed authentication
    • This includes creating the new parameter groups, attaching, and rebooting the db, as well as the SQL commands ran
  2. Apply postgres service resource definition to kubernetes
  3. Allow agents to attempt to connect to database

Describe the results you received:

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

jmeunier28 commented 11 months ago

Hi @ashleywxwx, I was looking into this briefly & have some reason to believe that there may be a memory leak in the version of the boto library that is shipped in 7.49 of the agent. As a quick test, could you downgrade to 7.46, the first agent version that contained this feature to see if you still see the issue?

We have other customers using this feature without any issues on earlier versions of the agent, so I am trying to rule out the boto API library version bump.

ashleywxwx commented 11 months ago

@jmeunier28

I have updated to gcr.io/datadoghq/agent:7.46.0-rc.2 and still receiving the same issue. Let me know if there's a different tag I should use. I'll also go ahead and try a 7.45 version here for kicks.

jmeunier28 commented 11 months ago

Weird that rc refers to "release candidate", which isn't an official DD release. It should be this one docker pull datadog/agent:7.46.0

ashleywxwx commented 11 months ago

Oh, my mistake, that explains why I couldn't find a tag under 7.46, I'll give 7.46.0 a try

ashleywxwx commented 11 months ago

For what it's worth, using version 7.45.1 I now see a different error, but haven't dug in much further. Configuration is the same (e.g. aws.region is provided) , so maybe a difference there?

2023-11-10 20:17:43 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:69 in Error) | check:postgres | Error running check: [{"message": "fe_sendauth: no password supplied
", "traceback": "Traceback (most recent call last):
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1135, in run
    self.check(instance)
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 725, in check
    raise e
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 693, in check
    self._connect()
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 522, in _connect
    self.db = self._new_connection(self._config.dbname)
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 506, in _new_connection
    conn = psycopg2.connect(**args)
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/psycopg2/__init__.py\", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: fe_sendauth: no password supplied

"}]
jmeunier28 commented 11 months ago

IAM is only supported starting in version 7.46 of the postgres agent

ashleywxwx commented 11 months ago

Okay, I have pinned to tag 7.46.0, but am still seeing the error.

2023-11-10 20:40:29 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:69 in Error) | check:postgres | Error running check: [{"message": "out of memory
", "traceback": "Traceback (most recent call last):
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1142, in run
    self.check(instance)
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 750, in check
    raise e
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 717, in check
    self._connect()
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 546, in _connect
    self.db = self._new_connection(self._config.dbname)
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 530, in _new_connection
    conn = psycopg2.connect(**args)
  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/psycopg2/__init__.py\", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: out of memory

"}]
jmeunier28 commented 11 months ago

@ashleywxwx can you tell me more about your setup? How much memory are you giving the agent container? Do you see this only when configuring the agent with IAM authentication or do you also see it when trying to connect via username/password?

OmriBenShoham commented 11 months ago

The same apply here using datadog-agent version: Agent 7.47.1 - Commit: 24dcc70 - Serialization version: v5.0.90 - Go version: go1.20.6

jmeunier28 commented 11 months ago

@OmriBenShoham can you please let me know your exact deploy setup:

  1. How are you deploying the agent
  2. What is the exact hardware its running on
  3. Containerized or not?
  4. How much memory is being allocated to the agent
  5. Does it work fine without enabling IAM authentication?
jmeunier28 commented 11 months ago

@ashleywxwx and @OmriBenShoham we were able to reproduce this internally by removing the permissions that need to be granted for the IAM user. So, we ran REVOKE rds_iam from datadog; & then saw the same error on agent startup

          conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
      psycopg2.OperationalError: out of memory

This permission is mentioned here & is necessary to make the feature work. To do this log in to your database instance as the root user, and grant the rds_iam role to the new user:

GRANT rds_iam TO <YOUR_IAM_ROLE>;

Once the role is all set up and attached to your instance, you can configure your instance config like this:

instances:
- dbm: true
 host: example-endpoint.us-east-2.rds.amazonaws.com
 port: 5432
 username: <YOUR IAM ROLE NAME>
 aws:
   region: <YOUR DB HOST'S REGION>

Can you double check that you performed all of these steps correctly in accordance with these docs https://docs.datadoghq.com/database_monitoring/guide/managed_authentication/#configure-iam-authentication?

As a side note the out of memory error is very misleading. I would expect error would raise here. We are investigating this more internally, but hopefully this unblocks you.

ashleywxwx commented 11 months ago

This permission is mentioned here & is necessary to make the feature work. To do this log in to your database instance as the root user, and grant the rds_iam role to the new user:

GRANT rds_iam TO <YOUR_IAM_ROLE>;

Okay, that was my mistake! The instructions were a little unclear, I believe in my case it is literally GRANT rds_iam to datadog (as opposed to an ARN or the name of the role created). Does that sound correct? Otherwise I get...

api.public> GRANT rds_iam TO iam_datadog_agent_dev
role "iam_datadog_agent_dev" is already a member of role "rds_iam"

Once I ran that command, I was able to connect. Well, with a different error beyond the scope of this thread, and I'll take a look at, but wanted to confirm that I could get past the Out of Memory exception. Thank you for your time!

Current error, I'll add encryption support to address, but for completeness:

datadog-agent-qtrmp agent 2023-11-21 18:16:12 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:69 in Error) | check:postgres | Error running
 check: [{"message": "FATAL:  pg_hba.conf rejects connection for host \"10.0.3.69\", user \"datadog\", database \"api\", no encryption\n", "traceback"
: "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 114
2, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 750,
 in check\n    raise e\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 717, in check\n
  self._connect()\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 546, in _connect\n
self.db = self._new_connection(self._config.dbname)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres
.py\", line 530, in _new_connection\n    conn = psycopg2.connect(**args)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/psycopg2/__
init__.py\", line 127, in connect\n    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)\npsycopg2.OperationalError: FATAL:  pg_h
ba.conf rejects connection for host \"10.0.3.69\", user \"datadog\", database \"api\", no encryption\n\n"}]
ashleywxwx commented 11 months ago

I was able to connect with a standard user & password, which should unblock me. If there is additional troubleshooting I can help with around this error, please let me know. Otherwise, we can close this issue.

Thank you for the help @jmeunier28

jmeunier28 commented 11 months ago

@ashleywxwx FWIW we had a few other people report this as well. Our docs were not very clear initially & told people to set the region in the aws block, which is required for IAM. What was not made clear is the fact that setting region means we will attempt IAM authentication and ignore the password set by the user. We have since updated our docs to make this distinction more clear here.

We also have some updates to make the instance configuration for IAM more clear, which will come out in a future release of the agent. Thanks for pointing out the issue to us!