Closed ento closed 2 weeks ago
@ento , thanks for the great bug report.
The fact that telemetry doesn't properly work with our agentless mode enabled is a known issue (and usually hidden by the fact that it's a debug-level log). Fixing that is a todo item for this coming quarter.
The fact that traces don't work with agentless is combined with --ddtrace-patch-all
, on the other hand, wasn't something I was aware of, and that definitely also needs a fix.
Hi @romainkomorndatadog !
Can you share a workaround to silence those debug logs? It's getting rather noisy on our test runs to the level of deactivating ddtrace for those test runs, which is not what I'd like to do for sake of observability.
@wearebasti , are you asking specifically about the log entries stating
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
?
DD_INSTRUMENTATION_TELEMETRY_ENABLED=false
should stop those (although they won't stop the similar ones about http://localhost:8126/v0.5/traces ).
That said, those messages should only appear if you have DD_TRACE_DEBUG=1
, and that would generate a lot of other unrelated debug messages that would be a lot more noisy, so it feels like we're talking about different things.
@romainkomorndatadog Thanks for picking this up, I was asking about the messages in the format
failed to send, dropping 17 traces to intake at http://localhost:8126/v0.5/traces after 3 retries, 10 additional messages skipped
Our Github Actions do not use DD_TRACE_DEBUG
and our test call is a simple pytest --ddtrace -n auto --junitxml=test-report.xml
, also setting explicitly DD_TRACE_DEBUG=false
does not change this behaviour.
Sorry if I hijack here the wrong issue, I thought this was the same root cause -- do you have any ideas?
@wearebasti , you're in the right issue (not hijacking it).
We do end up logging an error when we can't send traces (so it would indeed show up even with DD_TRACE_DEBUG=0
).
A few questions:
pytest.ini
config, PYTEST_ADDOPTS
env var, or something else that might be changing the CLI or file log levels?DD_CIVISIBILITY_AGENTLESS_ENABLED=true
)?I wasn't able to reproduce the logging myself until I used --log-cli-level error
as an arg to pytest
(which I rarely ever use in testing), and ran with neither an agent or agentless mode enabled.
As far as making the log entries errors go away, depending on your situation:
@wearebasti , you're in the right issue (not hijacking it).
@romainkomorndatadog A relief! :)
- Which versions of python, ddtrace, and pytest are you running?
Python: 3.12 (using actions/setup-python@v5
in github actions)
ddtrace: 2.4.0
pytest: 7.4.0
- Do the lines appear in file logs, or in CLI logs (or both)?
They appear in the github actions output, which appears to be some kind of "stdout" or "stderr" output
- Do you have a
pytest.ini
config,PYTEST_ADDOPTS
env var, or something else that might be changing the CLI or file log levels?
We use pyproject.toml
for settings, but only have the following
[tool.pytest.ini_options]
DJANGO_SETTINGS_MODULE = "foo.settings_test"
timeout = "300"
python_files = [
"tests.py",
"test_*.py",
"*_tests.py",
]
addopts = [
"--reuse-db",
"--condensed-warnings",
"--strict-markers",
"--doctest-modules",
"--browser", "chromium",
]
markers = [
"slow: marks tests as slow",
....and a couple more
]
- Are you using the agentless mode for our plugin (ie using
DD_CIVISIBILITY_AGENTLESS_ENABLED=true
)? yes, the settings are the following
DD_CIVISIBILITY_AGENTLESS_ENABLED: true
DD_CIVISIBILITY_ITR_ENABLED: true
DD_ENV: ci
DD_API_KEY: ***
DD_SITE: datadoghq.eu
DISCLAIMER:
We've just today updated to ddtrace 2.5.1 and explicitly set now DD_CIVISIBILITY_LOG_LEVEL: "error"
, for now the "failed to send traces" did not show up. But what seems to be new is in-between some deprecation warnings (from pydantic) being printed after the pytest run, we get the following 4 lines
{"event": "logger setup complete"}
{"event": "Datadog CI Visibility using agentless mode"}
{"event": "Intelligent Test Runner test skipping disabled by API"}
{"event": "CODEOWNERS file is not available"}
@wearebasti , we don't use GitHub actions for the jobs that run our tracer's tests with the ddtrace
plugin enabled. Maybe the logger setup ends up being a bit different in that environment than in my local shell (but that's still TBD).
The first two new messages you saw (logger setup and CI visibility) messages come from #7955 (and #7963 which will be in 2.6.0
introduces some more logging on top of this). I thought (and still think) they're an improvement, because they actually show what our plugin's doing within pytest
. The other two should have been around for some time. Something's a little odd if you're seeing those if you set the log level to error
since those messages are info or warning-level messages.
That said, the discussion about the logging changes from those PRs probably does belong in a separate issue, so that we can leave this one to the actual problem of certain traces failing to be properly sent in agentless mode. :)
Any update on this? Just ran into the same issue using tox and pytest.
@skuam , no updates, no. As far as I understand, unless someone specifically wants trace data in agentless mode (which we don't support at the moment), then this is a cosmetic issue, rather than a functional one, so I haven't given it much priority compared to other work I've got on deck.
Are you concerned with the logging, or with the lack of lower-level trace data in agentless mode?
unless someone specifically wants trace data in agentless mode (which we don't support at the moment)
Huh, I'd assumed trace data was supposed to be available in agentless mode - one motivation in trying CI Visibility was as an aid to find bottlenecks / areas for optimization in the test suite. (I know there's a GHA action for launching an agent; I haven't gotten around to switching to it.)
If trace data in agentless mode won't be supported in the foreseeable future, it'd be nice to have that stated in the test visibility setup doc - unless it's there and I just missed it: https://docs.datadoghq.com/tests/setup/python?tab=cloudciprovideragentless
this is a cosmetic issue
One actual change that this prompted me to make to our test suite was to fix tests that assert that certain logs were written to only look at logs that came from the logger that the test is interested in. Otherwise, pytest's caplog
fixture would pick up logs from ddtrace and cause flaky failures. ddtrace is not the only library that could emit logs in the background, so I think this was the right change regardless of ddtrace's behavior, but thought I'd share.
@ento I'll look at if we can slot supporting trace data into our schedule somewhere. It likely won't happen for the next month, though. I'll add a note to our docs. Thanks for raising that.
As far as logging goes, I think it would make sense to get ddtrace
's logs excluded by default (via some handler magic), although I think that presents a different issue for people who are using ddtrace
in their code and tests (rather than just the pytest
integration).
The line between "ddtrace
internals" and integrations / "external" code is a bit murky, but there's probably a "pretty good" default behavior to find.
+1 on supporting traces in agent less mode, we definitely need this feature to debug our slow tests and have a holistic view on bottlenecks! @romainkomorndatadog do you think we can prioritize this to be done sooner?
More or less repeating https://github.com/DataDog/dd-trace-py/issues/8710#issuecomment-2063335333 on this issue as well: any help validating the fix in #9016 by running tests using ddtrace
installed with
pip install "ddtrace @ git+https://github.com/DataDog/dd-trace-py.git@79366300167b0a00886ee60b8db5839df23ef864"
would be appreciated.
Sorry it took a while to check it out.
:white_check_mark: When I tried it against a minimal test that makes an HTTP request, a corresponding span for the request showed up in CI Visibility.
test.py: ```py import requests def test(): requests.get("http://example.com") ``` Command: ```sh DD_CIVISIBILITY_AGENTLESS_ENABLED=true pytest test.py --ddtrace --ddtrace-patch-all --cov ``` I switched to using real API credentials since `DD_API_KEY=a DD_CIVISIBILITY_AGENTLESS_URL=http://localhost:8080` resulted in an authentication error and CI Visibility got disabled. > [Datadog CI Visibility] WARNING ddtrace.internal.ci_visibility.recorder:recorder.py:595 Authentication error, disabling CI Visibility, please check Datadog API key
:x: When I updated ddtrace in my project to the revision and ran the test suite in GitHub Actions, progress stalled at the 99% mark and it eventually hit the timeout set for the workflow after 25 minutes.
- Installing ddtrace (2.9.0.dev85+g7936630016 7936630)
...
============================= test session starts ==============================
...
Error: The action 'Run command' has timed out after 40 minutes.
On a separate run with DD_TRACE_DEBUG=1
enabled, I see a few logs that reference a local HTTP address interspersed between test execution logs:
2024-05-10T03:47:50.2072656Z creating new intake connection to http://localhost:8126 with timeout 2
Then at the end, there's lots of ValueErrors:
ValueError: I/O operation on closed file.
This continued for about 29 minutes before hitting the workflow timeout.
$ grep Traceback gha.log | wc -l
5266
$ grep ValueError gha.log | wc -l
5266
$ grep 'operation on closed' gha.log | wc -l
5266
$ grep Traceback gha.log | head -n1
2024-05-10T03:51:54.3355738Z Traceback (most recent call last):
$ grep Traceback gha.log | tail -n1
2024-05-10T04:20:41.1918860Z Traceback (most recent call last):
```
2024-05-10T03:51:54.3349534Z sent 904 in 0.31657s to https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry. response: 202
2024-05-10T03:51:54.3350540Z --- Logging error ---
2024-05-10T03:51:54.3355738Z Traceback (most recent call last):
2024-05-10T03:51:54.3367295Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit
2024-05-10T03:51:54.3368023Z stream.write(msg + self.terminator)
2024-05-10T03:51:54.3368447Z ValueError: I/O operation on closed file.
2024-05-10T03:51:54.3368789Z Call stack:
2024-05-10T03:51:54.3369368Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap
2024-05-10T03:51:54.3370004Z self._bootstrap_inner()
2024-05-10T03:51:54.3370973Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-05-10T03:51:54.3371621Z self.run()
2024-05-10T03:51:54.3372584Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run
2024-05-10T03:51:54.3373436Z self._target()
2024-05-10T03:51:54.3374426Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 759, in periodic
2024-05-10T03:51:54.3375550Z self._client.send_event(telemetry_event)
2024-05-10T03:51:54.3376627Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 180, in send_event
2024-05-10T03:51:54.3377892Z log.debug("sent %d in %.5fs to %s. response: %s", len(rb_json), sw.elapsed(), self.url, resp.status)
2024-05-10T03:51:54.3379175Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug
2024-05-10T03:51:54.3379843Z self._log(DEBUG, msg, args, **kwargs)
2024-05-10T03:51:54.3380487Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log
2024-05-10T03:51:54.3381100Z self.handle(record)
2024-05-10T03:51:54.3382150Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle
2024-05-10T03:51:54.3383271Z super(DDLogger, self).handle(record)
2024-05-10T03:51:54.3383728Z Message: 'sent %d in %.5fs to %s. response: %s'
2024-05-10T03:51:54.3384689Z Arguments: (904, 0.3165722100000039, 'https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry', 202)
2024-05-10T03:51:54.4524441Z [PID 3558 | PPID 3501] Subscriber ASM is getting data
2024-05-10T03:51:54.4525524Z --- Logging error ---
2024-05-10T03:51:54.4526126Z Traceback (most recent call last):
2024-05-10T03:51:54.4527550Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit
2024-05-10T03:51:54.4528462Z stream.write(msg + self.terminator)
2024-05-10T03:51:54.4528949Z ValueError: I/O operation on closed file.
2024-05-10T03:51:54.4529380Z Call stack:
2024-05-10T03:51:54.4530634Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap
2024-05-10T03:51:54.4531938Z self._bootstrap_inner()
2024-05-10T03:51:54.4533191Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-05-10T03:51:54.4533943Z self.run()
2024-05-10T03:51:54.4535372Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run
2024-05-10T03:51:54.4536498Z self._target()
2024-05-10T03:51:54.4538357Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/remoteconfig/_subscribers.py", line 45, in periodic
2024-05-10T03:51:54.4539515Z log.debug("[PID %d | PPID %d] %s is getting data", os.getpid(), os.getppid(), self)
2024-05-10T03:51:54.4540365Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug
2024-05-10T03:51:54.4540995Z self._log(DEBUG, msg, args, **kwargs)
2024-05-10T03:51:54.4541627Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log
2024-05-10T03:51:54.4542232Z self.handle(record)
2024-05-10T03:51:54.4543148Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle
2024-05-10T03:51:54.4543995Z super(DDLogger, self).handle(record)
2024-05-10T03:51:54.4544427Z Message: '[PID %d | PPID %d] %s is getting data'
2024-05-10T03:51:54.4545468Z Arguments: (3558, 3501, RemoteConfigSubscriber(status=
Not sure if this is a separate issue. (Those ValueErrors could be a side-effect of enabling DD_TRACE_DEBUG
and not the direct cause of the stall.)
@ento , thanks for the info. Just to clarify, your simple single-file test.py
test does not hang, only your larger project, correct?
I'm going to try and set up a GitHub Action to test this because I can't repro locally (and it doesn't repro in our CircleCI project for the various dd-trace-py
test suites that use our plugin).
If you have time to test it, DD_REMOTE_CONFIGURATION_ENABLED=false
should disable the remote config part that tries to talk to our agent (I guess that's something I should also disable or make work without an agent), but I don't see why that should cause a hang (eg: in my local testing without an agent, I see the messages that the agent is down or RC is not enabled in the agent, but I don't have a hang behavior).
I'd ask you to try comparing with not using --ddtrace-patch-all
, but I'm not sure how useful that would be in actually narrowing down the issue. Maybe a more interesting test would be to compare against ddtrace~=2.8.0
with --ddtrace-patch-all
.
The ValueError
looks similar to another issue I've got open internally which I noticed on some of our GitlabCI jobs. It looks like something is closing stderr/stdout and the log handler we set up when pytest starts ends up trying to write to a closed file. I do think that's a side effect rather than a cause of any hangs, though. Since you're using DD_TRACE_DEBUG=1
, our logger takeover function should be short-circuiting and not taking over the handler, so that means that particular issue isn't limited to the CI Visibility log handler (which is what I originally thought).
Side note for curiosity: in debug mode, the creating new intake connection to http://localhost:8126
come from the debug mode collector which spins up before our pytest
plugin has a chance to reconfigure the writer). That tells me maybe we do need to just move the agentless behavior into the default tracer.
Just to clarify, your simple single-file
test.py
test does not hang, only your larger project, correct?
That was the case when I wrote the comment, but I've managed to make (a variation of) the single-file test case hang.
First, install a specific version of freezegun:
pip install -U freezegun==1.2.2
Test file (a distilled version of the test suite that I ran in GitHub Actions)
import freezegun
from ddtrace import tracer
@tracer.wrap()
def hello():
return "hello"
@freezegun.freeze_time("2024-01-01")
def test():
assert hello() == "hello"
Hangs:
pytest test.py -p ddtrace
pytest test.py -p no:ddtrace
Doesn't hang:
DD_CIVISIBILITY_AGENTLESS_ENABLED=true pytest test.py -p ddtrace --ddtrace
DD_REMOTE_CONFIGURATION_ENABLED=false DD_CIVISIBILITY_AGENTLESS_ENABLED=true pytest test.py -p ddtrace
Adding the --ddtrace
flag, which I forgot to add when I enabled CI Visibility, and/or disabling remote configuration fixed my GitHub Actions workflow as well.
Upgrading freezegun also makes the issue go away
pip install -U freezegun==1.3.0
The place it was hanging was: https://github.com/DataDog/dd-trace-py/blob/79366300167b0a00886ee60b8db5839df23ef864/ddtrace/internal/remoteconfig/client.py#L217 which calls: https://github.com/DataDog/dd-trace-py/blob/79366300167b0a00886ee60b8db5839df23ef864/ddtrace/settings/config.py#L586
Because of the time freeze, time.monotonic()
wasn't increasing and the internal queue was waiting forever here:
https://github.com/python/cpython/blob/v3.11.6/Lib/multiprocessing/connection.py#L952
This issue has been automatically closed after a period of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.
Summary of problem
I'm trying out CI Visibility with a pytest test suite, and I'm seeing log messages like "failed to send, dropping 415 traces to intake at http://localhost:8126/v0.4/traces after 3 retries, 44 additional messages skipped".
Environment is GitHub Actions (hosted runner) with no datadog agent running.
Which version of dd-trace-py are you using?
Saw the error message with these versions: 1.20.14 2.3.3 4577f757 (latest commit as of this writing)
Which version of pip are you using?
23.3.2
Which libraries and their versions are you using?
`pip freeze` against the virtualenv that I set up locally using the steps below
``` $ pip freeze attrs==23.1.0 bytecode==0.15.1 cattrs==23.2.3 coverage==7.3.4 ddsketch==2.0.4 ddtrace @ https://github.com/DataDog/dd-trace-py/archive/4577f757cc8ab750d1fdbfae1fb767dcbacf6f22.zip#sha256=9e07cc061830bce8ef9fdd8e05acdcae6652e073fe7e07cbd4ae83fb17eceaa9 Deprecated==1.2.14 envier==0.5.0 importlib-metadata==6.11.0 iniconfig==2.0.0 opentelemetry-api==1.22.0 packaging==23.2 pluggy==1.3.0 protobuf==4.25.1 pytest==7.4.3 pytest-cov==4.1.0 six==1.16.0 typing_extensions==4.9.0 wrapt==1.16.0 xmltodict==0.13.0 zipp==3.17.0 ```How can we reproduce your problem?
Set up a virtualenv with ddtrace and pytest installed
Run a mock HTTP server (to be used as CI Visibility's intake host). This isn't necessary to reproduce, but it quiets unrelated errors.
What is the result that you get?
With
--ddtrace
enabled:With
--ddtrace-patch-all
and--cov
(to generate spans) enabled in additionRequests get sent to the mock HTTP server that
DD_CIVISIBILITY_AGENTLESS_URL
points to, so I believe agentless mode gets enabled as intended with this setup.What is the result that you expected?
No failure trying to send telemetry or traces.
When I view test runs in Datadog, I only see top-level spans for each test case and no spans for database access, etc., which may be because ddtrace isn't able to send traces.