DataDog / dd-trace-py

Datadog Python APM Client
https://ddtrace.readthedocs.io/
Other
550 stars 412 forks source link

CI Visibility: tries to send telemetry and trace data to local agent even when agentless mode is enabled #7989

Closed ento closed 2 weeks ago

ento commented 10 months ago

Summary of problem

I'm trying out CI Visibility with a pytest test suite, and I'm seeing log messages like "failed to send, dropping 415 traces to intake at http://localhost:8126/v0.4/traces after 3 retries, 44 additional messages skipped".

Environment is GitHub Actions (hosted runner) with no datadog agent running.

Which version of dd-trace-py are you using?

Saw the error message with these versions: 1.20.14 2.3.3 4577f757 (latest commit as of this writing)

Which version of pip are you using?

23.3.2

Which libraries and their versions are you using?

`pip freeze` against the virtualenv that I set up locally using the steps below ``` $ pip freeze attrs==23.1.0 bytecode==0.15.1 cattrs==23.2.3 coverage==7.3.4 ddsketch==2.0.4 ddtrace @ https://github.com/DataDog/dd-trace-py/archive/4577f757cc8ab750d1fdbfae1fb767dcbacf6f22.zip#sha256=9e07cc061830bce8ef9fdd8e05acdcae6652e073fe7e07cbd4ae83fb17eceaa9 Deprecated==1.2.14 envier==0.5.0 importlib-metadata==6.11.0 iniconfig==2.0.0 opentelemetry-api==1.22.0 packaging==23.2 pluggy==1.3.0 protobuf==4.25.1 pytest==7.4.3 pytest-cov==4.1.0 six==1.16.0 typing_extensions==4.9.0 wrapt==1.16.0 xmltodict==0.13.0 zipp==3.17.0 ```

How can we reproduce your problem?

Set up a virtualenv with ddtrace and pytest installed

python3 -m venv env
. env/bin/activate
SETUPTOOLS_SCM_PRETEND_VERSION=2.4.0dev pip install https://github.com/DataDog/dd-trace-py/archive/4577f757cc8ab750d1fdbfae1fb767dcbacf6f22.zip
pip install pytest pytest-cov
echo 'def test(): pass' > test.py

Run a mock HTTP server (to be used as CI Visibility's intake host). This isn't necessary to reproduce, but it quiets unrelated errors.

docker run --rm -p 8080:8080 -t mendhak/http-https-echo:31

What is the result that you get?

With --ddtrace enabled:

DD_CIVISIBILITY_AGENTLESS_ENABLED=true DD_TRACE_DEBUG=1 DD_API_KEY=a DD_CIVISIBILITY_AGENTLESS_URL=http://localhost:8080  pytest test.py --ddtrace 2>&1 | grep 8126
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.

With --ddtrace-patch-all and --cov (to generate spans) enabled in addition

$ DD_CIVISIBILITY_AGENTLESS_ENABLED=true DD_TRACE_DEBUG=1 DD_API_KEY=a DD_CIVISIBILITY_AGENTLESS_URL=http://localhost:8080  pytest test.py --ddtrace --ddtrace-patch-all --cov 2>&1 | grep 8126
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
creating new intake connection to http://localhost:8126 with timeout 2
creating new intake connection to http://localhost:8126 with timeout 2
creating new intake connection to http://localhost:8126 with timeout 2
failed to send, dropping 65 traces to intake at http://localhost:8126/v0.5/traces after 3 retries
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.

Requests get sent to the mock HTTP server that DD_CIVISIBILITY_AGENTLESS_URL points to, so I believe agentless mode gets enabled as intended with this setup.

::ffff:172.17.0.1 - - [22/Dec/2023:23:09:49 +0000] "POST /api/v2/citestcycle HTTP/1.1" 200 6860 "-" "-"

What is the result that you expected?

No failure trying to send telemetry or traces.

When I view test runs in Datadog, I only see top-level spans for each test case and no spans for database access, etc., which may be because ddtrace isn't able to send traces.

romainkomorndatadog commented 9 months ago

@ento , thanks for the great bug report.

The fact that telemetry doesn't properly work with our agentless mode enabled is a known issue (and usually hidden by the fact that it's a debug-level log). Fixing that is a todo item for this coming quarter.

The fact that traces don't work with agentless is combined with --ddtrace-patch-all, on the other hand, wasn't something I was aware of, and that definitely also needs a fix.

wearebasti commented 9 months ago

Hi @romainkomorndatadog !

Can you share a workaround to silence those debug logs? It's getting rather noisy on our test runs to the level of deactivating ddtrace for those test runs, which is not what I'd like to do for sake of observability.

romainkomorndatadog commented 9 months ago

@wearebasti , are you asking specifically about the log entries stating

failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.

? DD_INSTRUMENTATION_TELEMETRY_ENABLED=false should stop those (although they won't stop the similar ones about http://localhost:8126/v0.5/traces ).

That said, those messages should only appear if you have DD_TRACE_DEBUG=1, and that would generate a lot of other unrelated debug messages that would be a lot more noisy, so it feels like we're talking about different things.

wearebasti commented 9 months ago

@romainkomorndatadog Thanks for picking this up, I was asking about the messages in the format

failed to send, dropping 17 traces to intake at http://localhost:8126/v0.5/traces after 3 retries, 10 additional messages skipped

Our Github Actions do not use DD_TRACE_DEBUG and our test call is a simple pytest --ddtrace -n auto --junitxml=test-report.xml, also setting explicitly DD_TRACE_DEBUG=false does not change this behaviour.

Sorry if I hijack here the wrong issue, I thought this was the same root cause -- do you have any ideas?

romainkomorndatadog commented 9 months ago

@wearebasti , you're in the right issue (not hijacking it).

We do end up logging an error when we can't send traces (so it would indeed show up even with DD_TRACE_DEBUG=0).

A few questions:

I wasn't able to reproduce the logging myself until I used --log-cli-level error as an arg to pytest (which I rarely ever use in testing), and ran with neither an agent or agentless mode enabled.

As far as making the log entries errors go away, depending on your situation:

wearebasti commented 9 months ago

@wearebasti , you're in the right issue (not hijacking it).

@romainkomorndatadog A relief! :)

  • Which versions of python, ddtrace, and pytest are you running?

Python: 3.12 (using actions/setup-python@v5 in github actions) ddtrace: 2.4.0 pytest: 7.4.0

  • Do the lines appear in file logs, or in CLI logs (or both)?

They appear in the github actions output, which appears to be some kind of "stdout" or "stderr" output

  • Do you have a pytest.ini config, PYTEST_ADDOPTS env var, or something else that might be changing the CLI or file log levels?

We use pyproject.toml for settings, but only have the following

[tool.pytest.ini_options]
DJANGO_SETTINGS_MODULE = "foo.settings_test"
timeout = "300"
python_files = [
    "tests.py",
    "test_*.py",
    "*_tests.py",
]
addopts = [
    "--reuse-db",
    "--condensed-warnings",
    "--strict-markers",
    "--doctest-modules",
    "--browser", "chromium",
]

markers = [
    "slow: marks tests as slow",
    ....and a couple more
]
  • Are you using the agentless mode for our plugin (ie using DD_CIVISIBILITY_AGENTLESS_ENABLED=true)? yes, the settings are the following
    DD_CIVISIBILITY_AGENTLESS_ENABLED: true
    DD_CIVISIBILITY_ITR_ENABLED: true
    DD_ENV: ci
    DD_API_KEY: ***
    DD_SITE: datadoghq.eu

DISCLAIMER: We've just today updated to ddtrace 2.5.1 and explicitly set now DD_CIVISIBILITY_LOG_LEVEL: "error", for now the "failed to send traces" did not show up. But what seems to be new is in-between some deprecation warnings (from pydantic) being printed after the pytest run, we get the following 4 lines

{"event": "logger setup complete"}
{"event": "Datadog CI Visibility using agentless mode"}
{"event": "Intelligent Test Runner test skipping disabled by API"}
{"event": "CODEOWNERS file is not available"}
romainkomorndatadog commented 9 months ago

@wearebasti , we don't use GitHub actions for the jobs that run our tracer's tests with the ddtrace plugin enabled. Maybe the logger setup ends up being a bit different in that environment than in my local shell (but that's still TBD).

The first two new messages you saw (logger setup and CI visibility) messages come from #7955 (and #7963 which will be in 2.6.0 introduces some more logging on top of this). I thought (and still think) they're an improvement, because they actually show what our plugin's doing within pytest. The other two should have been around for some time. Something's a little odd if you're seeing those if you set the log level to error since those messages are info or warning-level messages.

That said, the discussion about the logging changes from those PRs probably does belong in a separate issue, so that we can leave this one to the actual problem of certain traces failing to be properly sent in agentless mode. :)

skuam commented 8 months ago

Any update on this? Just ran into the same issue using tox and pytest.

romainkomorndatadog commented 8 months ago

@skuam , no updates, no. As far as I understand, unless someone specifically wants trace data in agentless mode (which we don't support at the moment), then this is a cosmetic issue, rather than a functional one, so I haven't given it much priority compared to other work I've got on deck.

Are you concerned with the logging, or with the lack of lower-level trace data in agentless mode?

ento commented 8 months ago

unless someone specifically wants trace data in agentless mode (which we don't support at the moment)

Huh, I'd assumed trace data was supposed to be available in agentless mode - one motivation in trying CI Visibility was as an aid to find bottlenecks / areas for optimization in the test suite. (I know there's a GHA action for launching an agent; I haven't gotten around to switching to it.)

If trace data in agentless mode won't be supported in the foreseeable future, it'd be nice to have that stated in the test visibility setup doc - unless it's there and I just missed it: https://docs.datadoghq.com/tests/setup/python?tab=cloudciprovideragentless

this is a cosmetic issue

One actual change that this prompted me to make to our test suite was to fix tests that assert that certain logs were written to only look at logs that came from the logger that the test is interested in. Otherwise, pytest's caplog fixture would pick up logs from ddtrace and cause flaky failures. ddtrace is not the only library that could emit logs in the background, so I think this was the right change regardless of ddtrace's behavior, but thought I'd share.

romainkomorndatadog commented 8 months ago

@ento I'll look at if we can slot supporting trace data into our schedule somewhere. It likely won't happen for the next month, though. I'll add a note to our docs. Thanks for raising that.

As far as logging goes, I think it would make sense to get ddtrace's logs excluded by default (via some handler magic), although I think that presents a different issue for people who are using ddtrace in their code and tests (rather than just the pytest integration).

The line between "ddtrace internals" and integrations / "external" code is a bit murky, but there's probably a "pretty good" default behavior to find.

hdadkhah1 commented 7 months ago

+1 on supporting traces in agent less mode, we definitely need this feature to debug our slow tests and have a holistic view on bottlenecks! @romainkomorndatadog do you think we can prioritize this to be done sooner?

romainkomorndatadog commented 6 months ago

More or less repeating https://github.com/DataDog/dd-trace-py/issues/8710#issuecomment-2063335333 on this issue as well: any help validating the fix in #9016 by running tests using ddtrace installed with

pip install "ddtrace @ git+https://github.com/DataDog/dd-trace-py.git@79366300167b0a00886ee60b8db5839df23ef864"

would be appreciated.

ento commented 5 months ago

Sorry it took a while to check it out.

:white_check_mark: When I tried it against a minimal test that makes an HTTP request, a corresponding span for the request showed up in CI Visibility.

Test setup

test.py: ```py import requests def test(): requests.get("http://example.com") ``` Command: ```sh DD_CIVISIBILITY_AGENTLESS_ENABLED=true pytest test.py --ddtrace --ddtrace-patch-all --cov ``` I switched to using real API credentials since `DD_API_KEY=a DD_CIVISIBILITY_AGENTLESS_URL=http://localhost:8080` resulted in an authentication error and CI Visibility got disabled. > [Datadog CI Visibility] WARNING ddtrace.internal.ci_visibility.recorder:recorder.py:595 Authentication error, disabling CI Visibility, please check Datadog API key

:x: When I updated ddtrace in my project to the revision and ran the test suite in GitHub Actions, progress stalled at the 99% mark and it eventually hit the timeout set for the workflow after 25 minutes.

  - Installing ddtrace (2.9.0.dev85+g7936630016 7936630)
...
============================= test session starts ==============================
...
Error: The action 'Run command' has timed out after 40 minutes.

On a separate run with DD_TRACE_DEBUG=1 enabled, I see a few logs that reference a local HTTP address interspersed between test execution logs:

2024-05-10T03:47:50.2072656Z creating new intake connection to http://localhost:8126 with timeout 2

Then at the end, there's lots of ValueErrors:

ValueError: I/O operation on closed file.

This continued for about 29 minutes before hitting the workflow timeout.

$ grep Traceback gha.log  | wc -l
5266
$ grep ValueError gha.log  | wc -l
5266
$ grep 'operation on closed' gha.log  | wc -l
5266
$ grep Traceback gha.log  | head -n1
2024-05-10T03:51:54.3355738Z Traceback (most recent call last):
$ grep Traceback gha.log  | tail -n1
2024-05-10T04:20:41.1918860Z Traceback (most recent call last):
The first few tracebacks

``` 2024-05-10T03:51:54.3349534Z sent 904 in 0.31657s to https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry. response: 202 2024-05-10T03:51:54.3350540Z --- Logging error --- 2024-05-10T03:51:54.3355738Z Traceback (most recent call last): 2024-05-10T03:51:54.3367295Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit 2024-05-10T03:51:54.3368023Z stream.write(msg + self.terminator) 2024-05-10T03:51:54.3368447Z ValueError: I/O operation on closed file. 2024-05-10T03:51:54.3368789Z Call stack: 2024-05-10T03:51:54.3369368Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap 2024-05-10T03:51:54.3370004Z self._bootstrap_inner() 2024-05-10T03:51:54.3370973Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2024-05-10T03:51:54.3371621Z self.run() 2024-05-10T03:51:54.3372584Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run 2024-05-10T03:51:54.3373436Z self._target() 2024-05-10T03:51:54.3374426Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 759, in periodic 2024-05-10T03:51:54.3375550Z self._client.send_event(telemetry_event) 2024-05-10T03:51:54.3376627Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 180, in send_event 2024-05-10T03:51:54.3377892Z log.debug("sent %d in %.5fs to %s. response: %s", len(rb_json), sw.elapsed(), self.url, resp.status) 2024-05-10T03:51:54.3379175Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug 2024-05-10T03:51:54.3379843Z self._log(DEBUG, msg, args, **kwargs) 2024-05-10T03:51:54.3380487Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log 2024-05-10T03:51:54.3381100Z self.handle(record) 2024-05-10T03:51:54.3382150Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle 2024-05-10T03:51:54.3383271Z super(DDLogger, self).handle(record) 2024-05-10T03:51:54.3383728Z Message: 'sent %d in %.5fs to %s. response: %s' 2024-05-10T03:51:54.3384689Z Arguments: (904, 0.3165722100000039, 'https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry', 202) 2024-05-10T03:51:54.4524441Z [PID 3558 | PPID 3501] Subscriber ASM is getting data 2024-05-10T03:51:54.4525524Z --- Logging error --- 2024-05-10T03:51:54.4526126Z Traceback (most recent call last): 2024-05-10T03:51:54.4527550Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit 2024-05-10T03:51:54.4528462Z stream.write(msg + self.terminator) 2024-05-10T03:51:54.4528949Z ValueError: I/O operation on closed file. 2024-05-10T03:51:54.4529380Z Call stack: 2024-05-10T03:51:54.4530634Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap 2024-05-10T03:51:54.4531938Z self._bootstrap_inner() 2024-05-10T03:51:54.4533191Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2024-05-10T03:51:54.4533943Z self.run() 2024-05-10T03:51:54.4535372Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run 2024-05-10T03:51:54.4536498Z self._target() 2024-05-10T03:51:54.4538357Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/remoteconfig/_subscribers.py", line 45, in periodic 2024-05-10T03:51:54.4539515Z log.debug("[PID %d | PPID %d] %s is getting data", os.getpid(), os.getppid(), self) 2024-05-10T03:51:54.4540365Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug 2024-05-10T03:51:54.4540995Z self._log(DEBUG, msg, args, **kwargs) 2024-05-10T03:51:54.4541627Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log 2024-05-10T03:51:54.4542232Z self.handle(record) 2024-05-10T03:51:54.4543148Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle 2024-05-10T03:51:54.4543995Z super(DDLogger, self).handle(record) 2024-05-10T03:51:54.4544427Z Message: '[PID %d | PPID %d] %s is getting data' 2024-05-10T03:51:54.4545468Z Arguments: (3558, 3501, RemoteConfigSubscriber(status=, _interval=2.5)) 2024-05-10T03:51:54.4546157Z [PID 3558 | PPID 3501] Subscriber ASM got data 2024-05-10T03:51:54.4546550Z --- Logging error --- 2024-05-10T03:51:54.4546830Z Traceback (most recent call last): 2024-05-10T03:51:54.4547469Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit 2024-05-10T03:51:54.4548097Z stream.write(msg + self.terminator) 2024-05-10T03:51:54.4548459Z ValueError: I/O operation on closed file. 2024-05-10T03:51:54.4548794Z Call stack: 2024-05-10T03:51:54.4549507Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap 2024-05-10T03:51:54.4550110Z self._bootstrap_inner() 2024-05-10T03:51:54.4550716Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2024-05-10T03:51:54.4551326Z self.run() 2024-05-10T03:51:54.4552206Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run 2024-05-10T03:51:54.4553029Z self._target() 2024-05-10T03:51:54.4554002Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/remoteconfig/_subscribers.py", line 47, in periodic 2024-05-10T03:51:54.4555093Z log.debug("[PID %d | PPID %d] %s got data", os.getpid(), os.getppid(), self) 2024-05-10T03:51:54.4555877Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug 2024-05-10T03:51:54.4556514Z self._log(DEBUG, msg, args, **kwargs) 2024-05-10T03:51:54.4557143Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log 2024-05-10T03:51:54.4557739Z self.handle(record) 2024-05-10T03:51:54.4558636Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle 2024-05-10T03:51:54.4559478Z super(DDLogger, self).handle(record) 2024-05-10T03:51:54.4559879Z Message: '[PID %d | PPID %d] %s got data' 2024-05-10T03:51:54.4560793Z Arguments: (3558, 3501, RemoteConfigSubscriber(status=, _interval=2.5)) 2024-05-10T03:51:54.6798280Z sent 727 in 0.34234s to https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry. response: 202 2024-05-10T03:51:54.6799658Z --- Logging error --- 2024-05-10T03:51:54.6800151Z Traceback (most recent call last): 2024-05-10T03:51:54.6801580Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit 2024-05-10T03:51:54.6802777Z stream.write(msg + self.terminator) 2024-05-10T03:51:54.6803444Z ValueError: I/O operation on closed file. 2024-05-10T03:51:54.6804041Z Call stack: 2024-05-10T03:51:54.6805031Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap 2024-05-10T03:51:54.6806140Z self._bootstrap_inner() 2024-05-10T03:51:54.6807269Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2024-05-10T03:51:54.6808422Z self.run() 2024-05-10T03:51:54.6810057Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run 2024-05-10T03:51:54.6811555Z self._target() 2024-05-10T03:51:54.6813471Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 759, in periodic 2024-05-10T03:51:54.6815199Z self._client.send_event(telemetry_event) 2024-05-10T03:51:54.6817143Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 180, in send_event 2024-05-10T03:51:54.6819226Z log.debug("sent %d in %.5fs to %s. response: %s", len(rb_json), sw.elapsed(), self.url, resp.status) 2024-05-10T03:51:54.6821258Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug 2024-05-10T03:51:54.6822429Z self._log(DEBUG, msg, args, **kwargs) 2024-05-10T03:51:54.6823573Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log 2024-05-10T03:51:54.6824688Z self.handle(record) 2024-05-10T03:51:54.6826343Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle 2024-05-10T03:51:54.6828199Z super(DDLogger, self).handle(record) 2024-05-10T03:51:54.6828980Z Message: 'sent %d in %.5fs to %s. response: %s' 2024-05-10T03:51:54.6830454Z Arguments: (727, 0.342344212999933, 'https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry', 202) 2024-05-10T03:51:55.6792971Z Agent is down or Remote Config is not enabled in the Agent 2024-05-10T03:51:55.6794098Z Check your Agent version, you need an Agent running on 7.39.1 version or above. 2024-05-10T03:51:55.6795225Z Check Your Remote Config environment variables on your Agent: 2024-05-10T03:51:55.6796021Z DD_REMOTE_CONFIGURATION_ENABLED=true 2024-05-10T03:51:55.6796903Z See: https://docs.datadoghq.com/agent/guide/how_remote_config_works/ 2024-05-10T03:51:55.6797997Z --- Logging error --- 2024-05-10T03:51:55.6798525Z Traceback (most recent call last): 2024-05-10T03:51:55.6799722Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit 2024-05-10T03:51:55.6800923Z stream.write(msg + self.terminator) 2024-05-10T03:51:55.6801338Z ValueError: I/O operation on closed file. 2024-05-10T03:51:55.6801681Z Call stack: 2024-05-10T03:51:55.6802251Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap 2024-05-10T03:51:55.6802887Z self._bootstrap_inner() 2024-05-10T03:51:55.6803552Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2024-05-10T03:51:55.6804187Z self.run() 2024-05-10T03:51:55.6805146Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run 2024-05-10T03:51:55.6806087Z self._target() 2024-05-10T03:51:55.6807072Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/remoteconfig/worker.py", line 74, in periodic 2024-05-10T03:51:55.6807988Z return self._state() 2024-05-10T03:51:55.6809000Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/remoteconfig/worker.py", line 53, in _agent_check 2024-05-10T03:51:55.6809919Z log.debug( 2024-05-10T03:51:55.6810471Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug 2024-05-10T03:51:55.6811102Z self._log(DEBUG, msg, args, **kwargs) 2024-05-10T03:51:55.6811757Z File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log 2024-05-10T03:51:55.6812377Z self.handle(record) 2024-05-10T03:51:55.6813274Z File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle 2024-05-10T03:51:55.6814129Z super(DDLogger, self).handle(record) 2024-05-10T03:51:55.6815843Z Message: 'Agent is down or Remote Config is not enabled in the Agent\nCheck your Agent version, you need an Agent running on 7.39.1 version or above.\nCheck Your Remote Config environment variables on your Agent:\nDD_REMOTE_CONFIGURATION_ENABLED=true\nSee: https://docs.datadoghq.com/agent/guide/how_remote_config_works/' 2024-05-10T03:51:55.6817489Z Arguments: () ```

Not sure if this is a separate issue. (Those ValueErrors could be a side-effect of enabling DD_TRACE_DEBUG and not the direct cause of the stall.)

romainkomorndatadog commented 5 months ago

@ento , thanks for the info. Just to clarify, your simple single-file test.pytest does not hang, only your larger project, correct?

I'm going to try and set up a GitHub Action to test this because I can't repro locally (and it doesn't repro in our CircleCI project for the various dd-trace-py test suites that use our plugin).

If you have time to test it, DD_REMOTE_CONFIGURATION_ENABLED=false should disable the remote config part that tries to talk to our agent (I guess that's something I should also disable or make work without an agent), but I don't see why that should cause a hang (eg: in my local testing without an agent, I see the messages that the agent is down or RC is not enabled in the agent, but I don't have a hang behavior).

I'd ask you to try comparing with not using --ddtrace-patch-all, but I'm not sure how useful that would be in actually narrowing down the issue. Maybe a more interesting test would be to compare against ddtrace~=2.8.0 with --ddtrace-patch-all.

The ValueError looks similar to another issue I've got open internally which I noticed on some of our GitlabCI jobs. It looks like something is closing stderr/stdout and the log handler we set up when pytest starts ends up trying to write to a closed file. I do think that's a side effect rather than a cause of any hangs, though. Since you're using DD_TRACE_DEBUG=1, our logger takeover function should be short-circuiting and not taking over the handler, so that means that particular issue isn't limited to the CI Visibility log handler (which is what I originally thought).

Side note for curiosity: in debug mode, the creating new intake connection to http://localhost:8126 come from the debug mode collector which spins up before our pytest plugin has a chance to reconfigure the writer). That tells me maybe we do need to just move the agentless behavior into the default tracer.

ento commented 5 months ago

Just to clarify, your simple single-file test.pytest does not hang, only your larger project, correct?

That was the case when I wrote the comment, but I've managed to make (a variation of) the single-file test case hang.

First, install a specific version of freezegun:

pip install -U freezegun==1.2.2

Test file (a distilled version of the test suite that I ran in GitHub Actions)

import freezegun
from ddtrace import tracer

@tracer.wrap()
def hello():
    return "hello"

@freezegun.freeze_time("2024-01-01")
def test():
    assert hello() == "hello"

Hangs:

pytest test.py -p ddtrace
pytest test.py -p no:ddtrace

Doesn't hang:

DD_CIVISIBILITY_AGENTLESS_ENABLED=true pytest test.py -p ddtrace --ddtrace
DD_REMOTE_CONFIGURATION_ENABLED=false DD_CIVISIBILITY_AGENTLESS_ENABLED=true pytest test.py -p ddtrace

Adding the --ddtrace flag, which I forgot to add when I enabled CI Visibility, and/or disabling remote configuration fixed my GitHub Actions workflow as well.

Upgrading freezegun also makes the issue go away

pip install -U freezegun==1.3.0

The place it was hanging was: https://github.com/DataDog/dd-trace-py/blob/79366300167b0a00886ee60b8db5839df23ef864/ddtrace/internal/remoteconfig/client.py#L217 which calls: https://github.com/DataDog/dd-trace-py/blob/79366300167b0a00886ee60b8db5839df23ef864/ddtrace/settings/config.py#L586

Because of the time freeze, time.monotonic() wasn't increasing and the internal queue was waiting forever here: https://github.com/python/cpython/blob/v3.11.6/Lib/multiprocessing/connection.py#L952

github-actions[bot] commented 2 weeks ago

This issue has been automatically closed after a period of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.