Closed ashleywxwx closed 11 months ago
Hi @ashleywxwx, I was looking into this briefly & have some reason to believe that there may be a memory leak in the version of the boto library that is shipped in 7.49 of the agent. As a quick test, could you downgrade to 7.46, the first agent version that contained this feature to see if you still see the issue?
We have other customers using this feature without any issues on earlier versions of the agent, so I am trying to rule out the boto API library version bump.
@jmeunier28
I have updated to gcr.io/datadoghq/agent:7.46.0-rc.2
and still receiving the same issue. Let me know if there's a different tag I should use. I'll also go ahead and try a 7.45 version here for kicks.
Weird that rc
refers to "release candidate", which isn't an official DD release. It should be this one docker pull datadog/agent:7.46.0
Oh, my mistake, that explains why I couldn't find a tag under 7.46
, I'll give 7.46.0
a try
For what it's worth, using version 7.45.1
I now see a different error, but haven't dug in much further. Configuration is the same (e.g. aws.region is provided) , so maybe a difference there?
2023-11-10 20:17:43 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:69 in Error) | check:postgres | Error running check: [{"message": "fe_sendauth: no password supplied
", "traceback": "Traceback (most recent call last):
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1135, in run
self.check(instance)
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 725, in check
raise e
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 693, in check
self._connect()
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 522, in _connect
self.db = self._new_connection(self._config.dbname)
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 506, in _new_connection
conn = psycopg2.connect(**args)
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/psycopg2/__init__.py\", line 127, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: fe_sendauth: no password supplied
"}]
IAM is only supported starting in version 7.46 of the postgres agent
Okay, I have pinned to tag 7.46.0
, but am still seeing the error.
2023-11-10 20:40:29 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:69 in Error) | check:postgres | Error running check: [{"message": "out of memory
", "traceback": "Traceback (most recent call last):
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1142, in run
self.check(instance)
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 750, in check
raise e
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 717, in check
self._connect()
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 546, in _connect
self.db = self._new_connection(self._config.dbname)
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 530, in _new_connection
conn = psycopg2.connect(**args)
File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/psycopg2/__init__.py\", line 127, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: out of memory
"}]
@ashleywxwx can you tell me more about your setup? How much memory are you giving the agent container? Do you see this only when configuring the agent with IAM authentication or do you also see it when trying to connect via username/password?
The same apply here using datadog-agent version:
Agent 7.47.1 - Commit: 24dcc70 - Serialization version: v5.0.90 - Go version: go1.20.6
@OmriBenShoham can you please let me know your exact deploy setup:
@ashleywxwx and @OmriBenShoham we were able to reproduce this internally by removing the permissions that need to be granted for the IAM user. So, we ran REVOKE rds_iam from datadog;
& then saw the same error on agent startup
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: out of memory
This permission is mentioned here & is necessary to make the feature work. To do this log in to your database instance as the root user, and grant the rds_iam role to the new user:
GRANT rds_iam TO <YOUR_IAM_ROLE>;
Once the role is all set up and attached to your instance, you can configure your instance config like this:
instances:
- dbm: true
host: example-endpoint.us-east-2.rds.amazonaws.com
port: 5432
username: <YOUR IAM ROLE NAME>
aws:
region: <YOUR DB HOST'S REGION>
Can you double check that you performed all of these steps correctly in accordance with these docs https://docs.datadoghq.com/database_monitoring/guide/managed_authentication/#configure-iam-authentication?
As a side note the out of memory
error is very misleading. I would expect error would raise here. We are investigating this more internally, but hopefully this unblocks you.
This permission is mentioned here & is necessary to make the feature work. To do this log in to your database instance as the root user, and grant the rds_iam role to the new user:
GRANT rds_iam TO <YOUR_IAM_ROLE>;
Okay, that was my mistake! The instructions were a little unclear, I believe in my case it is literally GRANT rds_iam to datadog
(as opposed to an ARN or the name of the role created). Does that sound correct? Otherwise I get...
api.public> GRANT rds_iam TO iam_datadog_agent_dev
role "iam_datadog_agent_dev" is already a member of role "rds_iam"
Once I ran that command, I was able to connect. Well, with a different error beyond the scope of this thread, and I'll take a look at, but wanted to confirm that I could get past the Out of Memory exception. Thank you for your time!
Current error, I'll add encryption support to address, but for completeness:
datadog-agent-qtrmp agent 2023-11-21 18:16:12 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:69 in Error) | check:postgres | Error running
check: [{"message": "FATAL: pg_hba.conf rejects connection for host \"10.0.3.69\", user \"datadog\", database \"api\", no encryption\n", "traceback"
: "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 114
2, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 750,
in check\n raise e\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 717, in check\n
self._connect()\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres.py\", line 546, in _connect\n
self.db = self._new_connection(self._config.dbname)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/postgres/postgres
.py\", line 530, in _new_connection\n conn = psycopg2.connect(**args)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/psycopg2/__
init__.py\", line 127, in connect\n conn = _connect(dsn, connection_factory=connection_factory, **kwasync)\npsycopg2.OperationalError: FATAL: pg_h
ba.conf rejects connection for host \"10.0.3.69\", user \"datadog\", database \"api\", no encryption\n\n"}]
I was able to connect with a standard user & password, which should unblock me. If there is additional troubleshooting I can help with around this error, please let me know. Otherwise, we can close this issue.
Thank you for the help @jmeunier28
@ashleywxwx FWIW we had a few other people report this as well. Our docs were not very clear initially & told people to set the region
in the aws
block, which is required for IAM. What was not made clear is the fact that setting region
means we will attempt IAM authentication and ignore the password set by the user. We have since updated our docs to make this distinction more clear here.
We also have some updates to make the instance configuration for IAM more clear, which will come out in a future release of the agent. Thanks for pointing out the issue to us!
We're adding postgres monitoring to an exiting datadog agent and seeing an "Out of Memory" exception when trying to connect. We specifically get the error when the postgres agent checker tries to connect to our AWS RDS Postgres instance. I would appreciate any help troubleshooting this issue.
Output of the info page
Postgres check configured via Service definition
Agent Configuration
IAM Role Configuration
Additional environment details (Operating System, Cloud provider, etc):
Steps to reproduce the issue:
Describe the results you received:
aws.region
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
agent status
showedpostgres (15.1.1)
but our RDS instance is on 14.7. Is this something that we can/should set? Or, is this backwards compatible?