DataDog / dd-trace-rb

Datadog Tracing Ruby Client
https://docs.datadoghq.com/tracing/
Other
310 stars 375 forks source link

Intermittent deployment failures on ddtrace v1.12.0 - exit status 155 #2955

Closed arich closed 11 months ago

arich commented 1 year ago

Current behaviour Hello! Since upgrading from ddtrace v1.10.1 to 1.11.x and 1.12.0, we've been having intermittent deployment failures due to our rails web worker exiting with status 155.

We use Aptible for hosting. Here's the specific error from a failed deploy:

INFO -- : WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
ERROR -- : 1 of 6 HTTP health checks failed
ERROR -- : Container for web (running ["bundle", "exec", "ddtracerb", "exec", "rails", "server", "-b", "0.0.0.0"]) failed to start (exited with status 155 after 16 seconds).

Expected behaviour Deploys should consistently succeed without intermittent issues. We've had to downgrade back to ddtrace 1.10.1 to stabilize our deploys.

Steps to reproduce I'm not totally sure how to reproduce this reliably. We do not use the mysql2 or rugged gems. I can provide our full Gemfile.lock but I'll start here and see if we can get somewhere without it. I'm wondering if you've heard similar reports from anyone else?

Our Aptible support led us to believe this was related to the ddtrace upgrade:

  • Datadog, in the form of a SIGPROF signal, is signaling your process to profile it
  • the process being signaled seems to have crashed or took too long to be profiled (aka something happened to it)
  • The error you're hitting is Profiling timer expired, which appears to come from the process exiting
  • That error also returns the 155 exit status, which based on this, would be equivalent to 155-128=signal 27, which is a SIGPROF.

The instance behind your successful and failed deployment attempts were all the same, so there is little on the infrastructure side to explain the behavior. It seems to be a timeout or intermittent crashing issue with the processes you're attempting to run inside the container.

There is also a GH issue about the rugged gem is not properly handling UNIX signals, so maybe that's what you're hitting.

Thank you for your help!

Environment

ivoanjo commented 1 year ago

Hey Andrew :wave:! Thanks for the detailed bug report.

Given your mention of the issue showing up first on 1.11.0, which was the first release of dd-trace-rb where we started using signals by default, it definitely seems that's the cause.

  1. As it's not visible from the configuration block, let me just to double-check: Can you confirm that you are using the profiler? (Because otherwise all that I'm going to write below doesn't make sense!)

  2. To support a few usecases where using signals isn't appropriate (such as the one you mentioned above with the rugged gem + a few others), dd-trace-rb 1.12 actually introduces a new setting that explicitly avoids using signals in the profiler.

    This mode can be toggled via an environment variable DD_PROFILING_NO_SIGNALS_WORKAROUND_ENABLED=true or via configuration block c.profiling.advanced.no_signals_workaround_enabled = true.

    (The downside of this mode is that data collection is more coarse, mostly equivalent to what you're getting in v1.10.1 ).

    Can you try out this mode and let me know if it unblocks your upgrade to the latest dd-trace-rb releases?


Having said the above, your report is definitely a head-scratcher. For context: the new "CPU Profiling 2.0" feature relies on first installing a SIGPROF signal handler on your Ruby process and then periodically sending SIGPROF signals to the same process to trigger data collection.

The Profiling timer expired and error code 155 are system defaults for when no signal handler for SIGPROF is installed. You can see that with any app that doesn't override the default (which is most of them). Here's an example with curl, for instance (signal 27 is SIGPROF):

image

The profiler goes through a lot of checks when it installs the signal handler, and it turns itself off if anything fails. So it should not be possible for the profiler to send SIGPROF signals without the signal handler being correctly in place.

  1. If possible, do you mind re-running with debug logging enabled and sharing the resulting logs from the failing process?

  2. Can you share your Gemfile.lock as well? I'm curious in particular if there's any gem there that has the potential to "clobber" the dd-trace-rb signal handler after it's been installed.

arich commented 1 year ago

Hi @ivoanjo! Thanks so much for your detailed reply.

  1. Yes, we are using the profiler. We start the web server with ddtracerb.
    web: bundle exec ddtracerb exec rails server -b 0.0.0.0
    sidekiq: bundle exec ddtracerb exec sidekiq
    
    bash-5.1# echo $DD_PROFILING_ENABLED
    true
    bash-5.1# echo $DD_VERSION

bash-5.1# echo $DD_ENV staging bash-5.1# echo $DD_SERVICE


Not all the env vars are set that are listed [here](https://docs.datadoghq.com/profiler/enabling/ruby/?tab=environmentvariables), but version and service don't seem important? This is for our web server environment. 

2. I'll try this out! Thank you. I was sure there must be some sort of flag but I couldn't find it documented anywhere. 

3. I'll get back to you on this once I've enabled debugging. I'm hopeful that we can reproduce it quickly. 

4. [Here's a gist of our Gemfile.lock](https://gist.github.com/arich/5a4858d68511195d66867e52b96e001f)

5. One other thing to note here is that we have an odd setup for datadog agent - we run it as another process running in the same docker container, both managed by s6-overlay. We do this so we can transmit sidekiq metrics over UDP to be submitted to datadog, and this seemed like the cleanest option available. I'm hoping this has nothing to do with it, but I thought I should mention that the ruby process is running within a lightweight process supervisor. 
arich commented 1 year ago
  1. After reading more about debug mode:

    We do NOT recommend use of this feature in production or other sensitive environments, as it can be very verbose under load. It’s best to use this in a controlled environment where you can control application load.

I went back to check if this is happening on our staging environment as well. It doesn't appear to have happened on staging, and we deploy after every pull request is merged to staging, and deploy about once a day manually to production.

I think given this warning I'm not super comfortable turning on debug mode on production. But is strange that it doesn't happen on staging, where we have more deploys occurring. I wonder if there's some other config different between the two we should sync up.

ivoanjo commented 1 year ago

Thanks for the answers!

  1. [...] Not all the env vars are set that are listed here, but version and service don't seem important? This is for our web server environment.

Version and service are both optional yet recommended :wink:

Setting the version enables you to easily compare deployments of the service (diff my performance profile vs the previous version). Setting the service allows you to control the name of the service that shows up in profiling.

  1. I'll try this out! Thank you. I was sure there must be some sort of flag but I couldn't find it documented anywhere.

Yeah, I do admit it's kinda hidden in the profiler troubleshooting page, but it's great you reached out since we definitely want to try to fix all of the cases that need this workaround :smile:

  1. [...] Here's a gist of our Gemfile.lock

Thanks for sharing! I went through the list and the only thing that jumped out at me is that you're using "sentry-ruby" which optionally can also do profiling, but it relies on the stackprof gem which I don't see in your Gemfile so I guess you're not using that feature?

  1. One other thing to note here is that we have an odd setup for datadog agent - we run it as another process running in the same docker container, both managed by s6-overlay. We do this so we can transmit sidekiq metrics over UDP to be submitted to datadog, and this seemed like the cleanest option available. I'm hoping this has nothing to do with it, but I thought I should mention that the ruby process is running within a lightweight process supervisor.

This should be fine, I don't think it'll impact profiling (or other features).

  1. After reading more about debug mode: [...]

I went back to check if this is happening on our staging environment as well. It doesn't appear to have happened on staging, and we deploy after every pull request is merged to staging, and deploy about once a day manually to production.

I think given this warning I'm not super comfortable turning on debug mode on production. But is strange that it doesn't happen on staging, where we have more deploys occurring. I wonder if there's some other config different between the two we should sync up.

I'm definitely curious if you are able to spot any difference between your staging and production environment that may be triggering this on one but not the other.

As for the impact of the debug mode, completely understandable! Would you instead be up for trying to run with a custom branch where I can add some useful, yet non-verbose extra logging information?

arich commented 1 year ago
  1. Yeah, we considered using Sentry for APM and profiling for a bit way back but went with Datadog for APM and profiling. We still use Sentry for error tracking. We should have both disabled.

    Sentry.init do |config|
    ...
    config.traces_sample_rate = 0
    ...
    end
  2. I'll reach out to aptible support again and see if they have a clue why it's happening for one environment but not the other. They've said there are no hardware changes between deploys that succeed and deploys that fail, but maybe there's something else. I know that our staging instance is multi-tenant, while the production instance is single tenant. Still, this shouldn't be an issue.

Yeah, we could run a custom branch with extra logging! Again, we'll have to wait for a deploy failure to get info but it will be worth it.

Thank you for all the work you're putting in on this, I really appreciate it.

ivoanjo commented 1 year ago

Yeah, we could run a custom branch with extra logging! Again, we'll have to wait for a deploy failure to get info but it will be worth it.

Thank you for all the work you're putting in on this, I really appreciate it.

Great! I'll put something together and share it here in the next days!

In the meanwhile, if you get the chance to try running with the no signals workaround enabled and seeing if you still get the issues it would be great! I'd like to use that experiment to confirm that indeed the issue is coming from signals sent by the profiler.

arich commented 1 year ago

Sure, will do!

arich commented 1 year ago

Aptible support responded just now:

Hey Andrew, I'm wondering if this is a Docker version issue. All App instances on shared-us-west-1 (where bl-staging is) are version 20.10.20 while some of them were 20.10.7 still on your production stack. I've drained the 20.10.7 instances (which prevents future deploys from being placed on those instances), which may resolve the issue here. — Michael

For now I'm leaving the no signals workaround disabled and we'll see if we get any deploy failures. 2/2 deploys today were fine with ddtrace at v 1.12.1.

arich commented 1 year ago

Hi, updating you on the latest. We got some more consistent deploy failures last week so I'm going to go ahead and turn on the no signals workaround for now. I talked to Aptible support again and they're not sure why we're experiencing this on prod but not on staging (shared infrastructure).

Unfortunately nothing really stands out - I don't have any extra insight here besides the operation errors, which you also see when the operation fails. I've confirmed they're all on the latest docker version, and everything matches up with the shared Stack. Do you see any errors on the shared stack at all, or are they still just isolated to your dedicated stack?

Thanks for your help!

ivoanjo commented 1 year ago

Got it, thanks for the update!

In that case, I'll prepare a branch that adds the needed logging so we can track this down. I'll add it here once it's ready.

ivoanjo commented 1 year ago

Branch for testing is ready! You can use it by changing your Gemfile/gems.rb:

gem 'ddtrace', git: 'https://github.com/datadog/dd-trace-rb.git', branch: 'ivoanjo/missing-signal-handler'

Every ~10 seconds, it should print one of

[ddtrace] Skipping send signal, Signal handler is OK in pid=202941!
[ddtrace] Sending signal, Signal handler is OK in pid=202937!

If it detects the signal handler is missing, it will still let the Ruby process terminate immediately, but it will ask Ruby to print out a crash log. It will look something like

cpu_and_wall_time_worker.rb:70: [BUG] [ddtrace] Signal handler missing in pid=184713, handler=0x0000000000000000

ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux]

-- Control frame information -----------------------------------------------
...

-- Ruby level backtrace information ----------------------------------------
...

-- C level backtrace information -------------------------------------------
...

-- Other runtime information -----------------------------------------------
...

Let me know if you're able to test it, and what output you get! :pray:

arich commented 1 year ago

Hey! Sorry for the long delay here. I have it running on prod and it shows me:

[ddtrace] Skipping send signal, Signal handler is OK in pid=69!
E, [2023-08-31T18:49:13.396304 #69] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/bundler/gems/dd-trace-rb-01251ded587d/lib/datadog/core/remote/negotiation.rb:39:in `endpoint?') agent reachable but does not report /v0.7/config
E, [2023-08-31T18:49:18.399886 #69] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/bundler/gems/dd-trace-rb-01251ded587d/lib/datadog/core/remote/negotiation.rb:39:in `endpoint?') agent reachable but does not report /v0.7/config
[ddtrace] Skipping send signal, Signal handler is OK in pid=69!

Haven't hit a faulty deploy yet with the debugging on, but we have had deploy failures in the past week.

ivoanjo commented 1 year ago

Thanks for the update! Other than the added logging, nothing else is changed, so the chances of the issue occurring should remain the same.

Also, since the branch is based on an older release of dd-trace-rb, let me know if you'd like me to rebase it atop dd-trace-rb 1.14.0.

arich commented 1 year ago

We got a deployment failure! Do you see anything interesting in here? I can post the whole deploy log in a gist if that's helpful, but I think this is the most important part.

INFO -- : COMPLETED (after 12.94s): Start metric collectors for reef-sidekiq to metric drain datadog-production-metrics
INFO -- : WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks

INFO -- : WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks

INFO -- : WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks

INFO -- : WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks

INFO -- : WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
INFO -- : WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
ERROR -- : 1 of 6 HTTP health checks failed
ERROR -- : Container for web (running ["bundle", "exec", "ddtracerb", "exec", "rails", "server", "-b", "0.0.0.0"]) failed to start (exited with status 155 after 18 seconds).
ERROR -- : s6-rc: info: service s6rc-oneshot-runner: starting
ERROR -- : s6-rc: info: service s6rc-oneshot-runner successfully started
ERROR -- : s6-rc: info: service fix-attrs: starting
ERROR -- : s6-rc: info: service fix-attrs successfully started
ERROR -- : s6-rc: info: service legacy-cont-init: starting
ERROR -- : s6-rc: info: service legacy-cont-init successfully started
ERROR -- : s6-rc: info: service legacy-services: starting
ERROR -- : services-up: info: copying legacy longrun dogstatsd (no readiness notification)
ERROR -- : s6-rc: info: service legacy-services successfully started
ERROR -- : 2023-09-06 15:23:34 UTC | DSD | WARN | (pkg/util/log/log.go:618 in func1) | Unknown environment variable: DD_PROFILING_ENABLED
ERROR -- : 2023-09-06 15:23:34 UTC | DSD | WARN | (pkg/util/log/log.go:618 in func1) | Unknown environment variable: DD_APPLICATION_KEY
ERROR -- : 2023-09-06 15:23:34 UTC | DSD | WARN | (pkg/util/log/log.go:618 in func1) | Unknown environment variable: DD_AGENT_HOST
ERROR -- : 2023-09-06 15:23:34 UTC | DSD | INFO | (pkg/util/log/log.go:590 in func1) | 0 Features detected from environment:
ERROR -- : 2023-09-06 15:23:34 UTC | DSD | INFO | (comp/forwarder/defaultforwarder/default_forwarder.go:241 in NewDefaultForwarder) | Retry queue storage on disk is disabled
ERROR -- : 2023-09-06 15:23:34 UTC | DSD | INFO | (pkg/aggregator/time_sampler.go:52 in NewTimeSampler) | Creating TimeSampler #0
ERROR -- : 2023-09-06 15:23:34 UTC | DSD | INFO | (comp/forwarder/defaultforwarder/default_forwarder.go:366 in Start) | Forwarder started, sending to 1 endpoint(s) with 1 worker(s) each: "https://7-47-0-app.agent.datadoghq.com" (1 api key(s))
ERROR -- : 2023-09-06 15:23:41 UTC | DSD | INFO | (pkg/metadata/host/host.go:141 in getNetworkMeta) | could not get network metadata: could not detect network ID
ERROR -- : 2023-09-06 15:23:41 UTC | DSD | INFO | (pkg/serializer/serializer.go:410 in sendMetadata) | Sent metadata payload, size (raw/compressed): 1079/585 bytes.
ERROR -- : 2023-09-06 15:23:41 UTC | DSD | INFO | (comp/dogstatsd/listeners/udp.go:95 in Listen) | dogstatsd-udp: starting to listen on 127.0.0.1:8125
ERROR -- : 2023-09-06 15:23:42 UTC | DSD | INFO | (comp/forwarder/defaultforwarder/transaction/transaction.go:386 in internalProcess) | Successfully posted payload to "https://7-47-0-app.agent.datadoghq.com/intake/", the agent will only log transaction success every 500 transactions
ERROR -- : [ddtrace] Skipping send signal, Signal handler is OK in pid=69!
ERROR -- : Profiling timer expired
ERROR -- : s6-rc: info: service legacy-services: stopping
ERROR -- : 2023-09-06 15:23:47 UTC | DSD | INFO | (cmd/dogstatsd/subcommands/start/command.go:252 in handleSignals) | Received signal 'terminated', shutting down...
ERROR -- : 2023-09-06 15:23:47 UTC | DSD | WARN | (cmd/dogstatsd/subcommands/start/command.go:266 in StopAgent) | Some components were unhealthy: [aggregator metadata-host metadata-inventories dogstatsd-main forwarder]
ERROR -- : 2023-09-06 15:23:47 UTC | DSD | INFO | (cmd/dogstatsd/subcommands/start/command.go:285 in StopAgent) | See ya!
ERROR -- : s6-rc: info: service legacy-services successfully stopped
ERROR -- : s6-rc: info: service legacy-cont-init: stopping
ERROR -- : s6-rc: info: service legacy-cont-init successfully stopped
ERROR -- : s6-rc: info: service fix-attrs: stopping
ERROR -- : s6-rc: info: service fix-attrs successfully stopped
ERROR -- : s6-rc: info: service s6rc-oneshot-runner: stopping
ERROR -- : s6-rc: info: service s6rc-oneshot-runner successfully stopped
ERROR -- : For more information, review: https://deploy-docs.aptible.com/docs/http-health-checks-failed/
ERROR -- : FAILED: Wait up to 300s for containers in service web to respond to HTTP health checks
WARN -- : STARTING ROLLBACK: This operation encountered an error, Aptible is now attempting to rollback your infrastructure to its initial state.
ivoanjo commented 1 year ago

Thanks for the info. I was hoping that rather than getting the "Profiling timer expired" again, we'd get the extra info I've added, but it looks like my suspicion wasn't correct. :thinking:

I'll admit I'm a bit puzzled: if the sigprof causing the issue is indeed sent by the profiler, how can the profiler observe that the signal handler is in place, and then when the signal is sent, the process acts as if the signal handler is not in place?

Here's a few questions that I hope will reveal some leads to investigate:

I can post the whole deploy log in a gist if that's helpful, but I think this is the most important part.

  1. Yes please! I'm curious to see the order of some of the logs.

  2. By any chance, do your full deploy logs include the process ids (pids) for each of the logged messages? Since there's multiple processes contributing to the logs, I'm hoping to double-check which messages come from which ones.

  3. From your Gemfile.lock, I spotted you're using the puma web server. Can you share its configuration? In particular, I'm curious if puma is using fork to create more worker processes or if only a single puma process is running.

  4. What Datadog agent version are you running?

  5. Since it's been a while since we initially discussed this part, did you happen to see any issues when using the no_signals_workaround_enabled = true option, or do they only show up when this option is not used?

  6. Do the errors always happen during deployment, or do they happen afterwards too? It's curious that missing signal handlers would only happen soon after the process is started, but not after that.

arich commented 1 year ago
  1. https://gist.github.com/arich/70b9f70bd07756f7c27d664e3cb91b0b Here you go!

  2. I don't think there are PIDS in there, sorry. I could ask Aptible support for additional debugging logs?

  3. Puma config:

    
    # Puma can serve each request in a thread from an internal thread pool.
    # The `threads` method setting takes two numbers: a minimum and maximum.
    # Any libraries that use thread pools should be configured to match
    # the maximum value specified for Puma. Default is set to 5 threads for minimum
    # and maximum; this matches the default thread size of Active Record.
    #
    max_threads_count = ENV.fetch("RAILS_MAX_THREADS", 5)
    min_threads_count = ENV.fetch("RAILS_MIN_THREADS") { max_threads_count }
    threads min_threads_count, max_threads_count

Specifies the port that Puma will listen on to receive requests; default is 3000.

# port ENV.fetch("PORT", 3000)

Specifies the environment that Puma will run in.

# environment ENV.fetch("RAILS_ENV", "development")

Specifies the pidfile that Puma will use.

pidfile ENV.fetch("PIDFILE", "tmp/pids/server.pid")

Allow puma to be restarted by rails restart command.

plugin :tmp_restart


4. We weren't up to date until last week, but now we we are:

dd-agent@f6d8dea3f136:/$ agent version Agent 7.47.0 - Commit: b8c31bf - Serialization version: v5.0.90 - Go version: go1.20.6


5. I actually never got around to enabling that - you got the debug branch up fast enough that I didn't end up trying it. I can revisit that!

6. We continuously get this log in a rails console process when we had the debugger branch added:

E, [2023-09-01T15:10:09.495874 #70] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/bundler/gems/dd-trace-rb-01251ded587d/lib/datadog/core/remote/negotiation.rb:39:in endpoint?') agent reachable but does not report /v0.7/config E, [2023-09-01T15:10:14.499527 #70] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/bundler/gems/dd-trace-rb-01251ded587d/lib/datadog/core/remote/negotiation.rb:39:inendpoint?') agent reachable but does not report /v0.7/config [ddtrace] Skipping send signal, Signal handler is OK in pid=70!


I'm not sure if those are the errors you mean. I've since reverted to the normal ddtrace release for a quieter rails console :) 

I guess I don't know if I'm seeing containers die and auto replaced by aptible. I'm not aware of that behavior.

Thanks for all your help @ivoanjo! This has been a real journey. 
ivoanjo commented 1 year ago

Hey! Thanks Andrew for your careful replies of all my questions.

Log analysis

I spent some time looking through the full logs, and it seems that there are multiple Ruby processing starting up, and using profiling.

For instance, it looks like the first one is the rake db:migrate task:

2023-09-06 15:22:39 +0000 INFO -- : WAITING FOR: Run before_release from .aptible.yml: bundle exec rake db:migrate deploy:track

2023-09-06 15:23:27 +0000 INFO -- : I, [2023-09-06T15:23:17.480126 #69]  INFO -- ddtrace: [ddtrace] DATADOG CONFIGURATION - {"date":"2023-09-06T15:23:17+00:00","os_name":"x86_64-pc-linux-musl", ...  "service":"rake",  ... "profiling_enabled":true, ...

Notice the service: "rake" and the profiling_enabled: true (ellipses mine). At the point that message seems to be printed, this process has pid 69 (the #69 after the timestamp).

This seems to finish without issues:

2023-09-06 15:23:28 +0000 INFO -- : COMPLETED (after 60.52s): Run before_release from .aptible.yml: bundle exec rake db:migrate deploy:track

And then the web app tries to start:

2023-09-06 15:23:28 +0000 INFO -- : STARTING: Start app containers for service web

and then it fails ~88 seconds later:

2023-09-06 15:24:56 +0000 ERROR -- : Container for web (running ["bundle", "exec", "ddtracerb", "exec", "rails", "server", "-b", "0.0.0.0"]) failed to start (exited with status 155 after 18 seconds).

and then roughly at the same time these show up:

2023-09-06 15:24:56 +0000 ERROR -- : [ddtrace] Skipping send signal, Signal handler is OK in pid=69!
2023-09-06 15:24:56 +0000 ERROR -- : Profiling timer expired

This is weird because of the pid also being 69. Linux usually doesn't reuse pids until they "wrap around" the maximum, so this message would appear to be referring to the rake process. But then rake was marked as completed 90 seconds ago!

And then later it shows up again ~22 seconds later:

2023-09-06 15:25:18 +0000 ERROR -- : [ddtrace] Skipping send signal, Signal handler is OK in pid=69!
2023-09-06 15:25:18 +0000 ERROR -- : Profiling timer expired

...which is extra weird. What is this second process that failed with "profiling timer expired"? If it's the original pid 69, how can it fail twice?

Earlier on, I mentioned that your setup using s6-overlay should be fine and shouldn't impact on this issue but I'm less sure :thinking:.

If you can ask Aptible support about the possibility of getting logs using the pids, that would be great -- would help in trying to understand which Ruby processes are actually failing, and when.

Configuration

One odd thing I noticed in your configuration is that you mentioned you were running the Datadog agent together with the Ruby processes, but the Ruby processes are not configured to talk to this agent:

2023-09-06 15:23:27 +0000 INFO -- : I, [2023-09-06T15:23:17.480126 #69]  INFO -- ddtrace: [ddtrace] DATADOG CONFIGURATION - {..., "agent_url":"http://elb-brightline-35646.aptible.in:8126?timeout=30"

Should they be configured to report to localhost instead?

Other notes

We continuously get this log in a rails console process when we had the debugger branch added: E, [2023-09-01T15:10:09.495874 #70] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/bundler/gems/dd-trace-rb-01251ded587d/lib/datadog/core/remote/negotiation.rb:39:in `endpoint?') agent reachable but does not report /v0.7/config I'm not sure if those are the errors you mean. I've since reverted to the normal ddtrace release for a quieter rails console :)

Right! We've fixed that since in https://github.com/DataDog/dd-trace-rb/pull/3011 , but the debug branch did not include this PR yet.

Next steps

I actually never got around to enabling that - you got the debug branch up fast enough that I didn't end up trying it. I can revisit that!

I'd like to try enabling the no signals workaround next, with the latest ddtrace (1.14.0), if you can.

If it works with the no signals workaround enabled, we'll have confirmation that the rest of the profiler is working fine. If it doesn't, I think that will also provide valuable insights.

arich commented 1 year ago

Thank you! Very detailed, love the analysis.

Re: Configuration for our Agent URL: Sorry, I may have misspoken (and I had to check our configuration again to verify this). This is where our setup may be especially weird - we run the datadog agent in a separate app in the same VPC, but we run dogstatsd in the same container so we can send metrics via UDP. UDP endpoints are not an Aptible feature we can use between containers.. We could try using the datadog-agent that's running as a sidecar. I think currently we're only running the dogstatsd executable as a sidecar at the moment, not the full datadog agent.

Our DD_* env vars:

DD_AGENT_HOST=elb-brightline-35646.aptible.in
DD_API_KEY=[REDACTED]
DD_APPLICATION_KEY=[REDACTED]
DD_ENV=production
DD_HOSTNAME=dd-agent-production
DD_PROFILING_ENABLED=true

Our config/initializers/datadog.rb:

# frozen_string_literal: true

require "faraday"
require "net/http"
require "redis"
require "rest_client"

unless Rails.env.development? || Rails.env.test?
  Datadog.configure do |c|
    c.tracing.instrument :faraday, split_by_domain: true
    c.tracing.instrument :http, split_by_domain: true
    c.tracing.instrument :rails, log_injection: true, service_name: "rails"
    c.tracing.instrument :redis
    c.tracing.instrument :rest_client
    c.tracing.instrument :sidekiq, log_injection: true, service_name: "sidekiq"
    c.tracing.instrument :pg, comment_propagation: "full", service_name: "postgres"

    c.version = APP_VERSION
  end

  Datadog::Tracing.before_flush(
    # Do not trace health checks to data dog
    Datadog::Tracing::Pipeline::SpanFilter.new { |span| span.name == "rack.request" && span.get_tag("http.url") == "/health" }
  )
end

I will:

  1. try to get logs with PIDs,
  2. enable the no signals workaround

I'm a bit confused as to why ddtrace would be starting for rake db:migrate. I thought it was only starting for the actual processes itself, we have it set in our procfile for aptible:

web: bundle exec ddtracerb exec rails server -b 0.0.0.0
periodic: bundle exec ddtracerb exec sidekiq -q periodic
sidekiq: bundle exec ddtracerb exec sidekiq
sidekiq_priority: bundle exec ddtracerb exec sidekiq -q critical -q default

Maybe this isn't the recommended way to start it anymore?

I also just got a CI test failure where it tried to connect to datadog in a spec and failed. Probably unrelated but thought I'd mention it, given we're talking about ddtrace running when I'm not expecting it to be. DD_TRACE_TEST_MODE_ENABLED is not set, and should default to false.

Test output:

.............E, [2023-09-13T13:55:15.816276 #25] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/gems/ddtrace-1.14.0/lib/ddtrace/transport/http/client.rb:37:in `rescue in send_request') Internal error during Datadog::Transport::HTTP::Client request. Cause: Errno::ECONNREFUSED Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126) Location: /usr/local/lib/ruby/3.2.0/net/http.rb:1271:in `initialize'

I apologize if that turned into a ramble! Thanks for all your help.

ivoanjo commented 1 year ago

Thanks for clarifying the agent setup question. It shouldn't affect this case, so don't worry about changing it.

Also, thank you for sharing your configuration. That actually clarifies the next question:

I'm a bit confused as to why ddtrace would be starting for rake db:migrate. I thought it was only starting for the actual processes itself, we have it set in our procfile for aptible:

Maybe this isn't the recommended way to start it anymore?

Adding ddtrace exec is indeed the recommended way, because it ensures the profiler starts as early as possible in the process, and thus it catches e.g. application load.

But, if something else triggers dd-trace-rb start-up (for instance running the Datadog.configure block) the profiler will still start in that situation as well, because it will pick up DD_PROFILING_ENABLED.

Apologies for the confusing behavior in this case! My recommendation would be to prefix the rake task to disable the profiler for it, e.g. DD_PROFILING_ENABLED=false bundle exec rake db:migrate deploy:track.

I also just got a CI test failure where it tried to connect to datadog in a spec and failed. Probably unrelated but thought I'd mention it, given we're talking about ddtrace running when I'm not expecting it to be. DD_TRACE_TEST_MODE_ENABLED is not set, and should default to false.

This may be related to a change we've done recently, where the library sends some telemetry to the Datadog agent when it starts up. In the latest version of ddtrace we've disabled this in a number of non-production setups (e.g. https://github.com/DataDog/dd-trace-rb/pull/3039 ); if you do see it on 1.14.0, feel free to open a ticket and we can look into fixing our detection :)

arich commented 1 year ago

Test failures

Cool! Y'all are on top of it 💯 We are on 1.14.0 as of 9/6/2023, so it's weird that it popped up. I'll see if it pops up again and submit a ticket if it does.

Takeaways:

I will:

  1. Add the no_signals workaround
  2. Disable profiler for db:migrate before deploy tasks.

Thank you! Glad to hear we don't need to mess with our agent config for now 😄

arich commented 1 year ago

We hit the same error again (Internal error during Datadog::Transport::HTTP::Client request. Cause: Errno::ECONNREFUSED Failed to open TCP connection to 127.0.0.1:8126 during tests) so I'll open a new github issue. That's what you meant right? not datadog zendesk ticket?

ivoanjo commented 1 year ago

so I'll open a new github issue. That's what you meant right? not datadog zendesk ticket?

Yes, github issue! It's always OK to open a support/zendesk ticket, but in this case they'd have to escalate to us anyway ;)

arich commented 1 year ago

So far so good on deploy stability. I'm wondering what you want next steps to be, if you want to close this ticket, or if you have more debugging steps you want to try. I know the no signals workaround is not the preferred long term solution, and I'd love to be a vanilla datadog customer without workarounds. But at the same time I think we're getting what we need and I could forget about this in a few weeks.

Thanks for all your help!

ivoanjo commented 1 year ago

Glad to know the "no signals" workaround is working!

I know the no signals workaround is not the preferred long term solution, and I'd love to be a vanilla datadog customer without workarounds. But at the same time I think we're getting what we need and I could forget about this in a few weeks.

Very fair! To be clear, the "no signals" workaround is a feature of the profiler that will remain for as long as the profiler needs to use signals to work. So our intention to avoid its use is to improve the ecosystem and data gathering, and less because it's going away.

(I do have some ideas on how to profile without signals [in a better way than what the no signals workaround does] that I hope to prototype at some point, but no specific plans for now!)

I'm wondering what you want next steps to be, if you want to close this ticket, or if you have more debugging steps you want to try.

Thanks for the patience so far!

I have an idea for what we could try next, but if you're happy with the current configuration and want to move on and do more interesting things than help me muck about in the profiler, it's 100% understandable, and I'll be happy to close the ticket.

(Out of curiosity, my idea was to prepare a branch that changes the profiler to use a different signal than SIGPROF. This would allow us to check if there's something specifically interfering with SIGPROF on your setup, and also there's some signals that can generate core dumps when they aren't handled and that would also be useful for debugging).

arich commented 1 year ago

I'm down to keep going! I mostly thought you'd wanna close this out.

I'm going to close out my active ticket on my side, my response time may be a bit lower if I'm busy with something else. All that sounds super cool and I should be able to keep testing on my side.

Thanks!

ivoanjo commented 1 year ago

Awesome!

Here's a new branch with the experiment I described above:

gem 'ddtrace', git: 'https://github.com/datadog/dd-trace-rb.git', branch: 'ivoanjo/experiment-use-sigtrap'

And here's what to expect:

This branch is built atop v1.14.0 stable + two changes:

If everything goes well, and the issue we were seeing was specific to SIGPROF, the profiler should just work. Otherwise, if the same issue shows up, instead of Profiling timer expired you should observe a Trace/breakpoint trap message (which is the default system message for SIGTRAP).

Additionally, the profiler will try to enable core dumps, so if it goes well you should see a log message such as:

[ddtrace] Enabled core dumps. Maximum size: 18446744073709551615 Output pattern: 'core'

With both together, you may see Trace/breakpoint trap (core dumped) when the Ruby process dies, and hopefully you'll get a nice core dump that can be analyzed to get to the bottom of this issue.

Also, don't forget to disable the no signals workaround, otherwise none of these changes do anything ;)

ivoanjo commented 1 year ago

@arich did you by any chance get the opportunity to try it out?

arich commented 1 year ago

I have not yet! I'm sorry, things have been busy on our side. I'll try to do it in the next two weeks. Thank you for your help!

arich commented 1 year ago

Hey @ivoanjo, we deployed your fork yesterday. Thanks for your patience with us!

We're not seeing any different output in our logs so far, is that expected? I think I was expecting to have Enabled core dumps log somewhere in datadog logs.

I'll wait and see and let you know if we hit deploy errors! Thanks again for your help.

ivoanjo commented 1 year ago

We're not seeing any different output in our logs so far, is that expected? I think I was expecting to have Enabled core dumps log somewhere in datadog logs.

That's indeed unexpected. The crash handler functionality is independent of the rest of ddtrace and thus uses Ruby's Kernel.warn to log, unlike the rest of ddtrace. Here's the list of messages it can show, all of them prefixed with [ddtrace].

If it's not showing up, that's definitely curious; let me know if you don't get any of them, and I can tweak the logging.

I'll wait and see and let you know if we hit deploy errors! Thanks again for your help.

Fingers crossed that we'll finally get to the bottom of this! :)

arich commented 12 months ago

We are getting: [ddtrace] Core dumps already enabled, nothing to do! So it looks like it's returning early?

ivoanjo commented 12 months ago

Oh, good! That means we should be in a good state to collect a core dump if the app fails due to the SIGTRAP signal.

arich commented 11 months ago

Hi! We got a deploy failure. It's been pretty reliable (few failures), I think this is the second I've seen. I think this is the relevant part, here's a gist of the full logs if you need it.

13:57:47 INFO: WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
13:57:57 INFO: WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
13:58:07 INFO: WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
13:58:18 INFO: WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
13:58:28 INFO: WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
13:58:38 INFO: WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
13:58:48 INFO: WAITING FOR: Wait up to 300s for containers in service web to respond to HTTP health checks
13:58:52 ERROR: 1 of 6 HTTP health checks failed
13:58:52 ERROR: Container for web (running ["bundle", "exec", "ddtracerb", "exec", "rails", "server", "-b", "0.0.0.0"]) failed to start (exited with status 133 after 11 seconds).
13:58:52 ERROR: s6-rc: info: service s6rc-oneshot-runner: starting
13:58:52 ERROR: s6-rc: info: service s6rc-oneshot-runner successfully started
13:58:52 ERROR: s6-rc: info: service fix-attrs: starting
13:58:52 ERROR: s6-rc: info: service fix-attrs successfully started
13:58:52 ERROR: s6-rc: info: service legacy-cont-init: starting
13:58:52 ERROR: s6-rc: info: service legacy-cont-init successfully started
13:58:52 ERROR: s6-rc: info: service legacy-services: starting
13:58:52 ERROR: services-up: info: copying legacy longrun dogstatsd (no readiness notification)
13:58:52 ERROR: s6-rc: info: service legacy-services successfully started
13:58:52 ERROR: 2023-11-27 13:57:25 UTC | DSD | WARN | (pkg/util/log/log.go:665 in func1) | Unknown environment variable: DD_PROFILING_ENABLED
13:58:52 ERROR: 2023-11-27 13:57:25 UTC | DSD | WARN | (pkg/util/log/log.go:665 in func1) | Unknown environment variable: DD_APPLICATION_KEY
13:58:52 ERROR: 2023-11-27 13:57:25 UTC | DSD | WARN | (pkg/util/log/log.go:665 in func1) | Unknown environment variable: DD_AGENT_HOST
13:58:52 ERROR: 2023-11-27 13:57:25 UTC | DSD | INFO | (pkg/util/log/log.go:625 in func1) | 0 Features detected from environment:
13:58:52 ERROR: 2023-11-27 13:57:25 UTC | DSD | INFO | (comp/forwarder/defaultforwarder/default_forwarder.go:242 in NewDefaultForwarder) | Retry queue storage on disk is disabled
13:58:52 ERROR: 2023-11-27 13:57:25 UTC | DSD | INFO | (pkg/aggregator/time_sampler.go:52 in NewTimeSampler) | Creating TimeSampler #0
13:58:52 ERROR: 2023-11-27 13:57:25 UTC | DSD | INFO | (comp/forwarder/defaultforwarder/default_forwarder.go:367 in Start) | Forwarder started, sending to 1 endpoint(s) with 1 worker(s) each: "https://7-49-1-app.agent.datadoghq.com" (1 api key(s))
13:58:52 ERROR: [ddtrace] Core dumps already enabled, nothing to do!
13:58:52 ERROR: Trace/breakpoint trap (core dumped)
13:58:52 ERROR: s6-rc: info: service legacy-services: stopping
13:58:52 ERROR: 2023-11-27 13:57:32 UTC | DSD | INFO | (cmd/dogstatsd/subcommands/start/command.go:252 in handleSignals) | Received signal 'terminated', shutting down...
13:58:52 ERROR: 2023-11-27 13:57:32 UTC | DSD | INFO | (pkg/metadata/host/host.go:83 in getNetworkMeta) | could not get network metadata: could not detect network ID
13:58:52 ERROR: 2023-11-27 13:57:32 UTC | DSD | INFO | (pkg/serializer/serializer.go:410 in sendMetadata) | Sent metadata payload, size (raw/compressed): 1104/605 bytes.
13:58:52 ERROR: 2023-11-27 13:57:32 UTC | DSD | WARN | (cmd/dogstatsd/subcommands/start/command.go:266 in StopAgent) | Some components were unhealthy: [aggregator metadata-host metadata-inventories dogstatsd-main forwarder]
13:58:52 ERROR: 2023-11-27 13:57:32 UTC | DSD | INFO | (cmd/dogstatsd/subcommands/start/command.go:285 in StopAgent) | See ya!
13:58:52 ERROR: s6-rc: info: service legacy-services successfully stopped
13:58:52 ERROR: s6-rc: info: service legacy-cont-init: stopping
13:58:52 ERROR: s6-rc: info: service legacy-cont-init successfully stopped
13:58:52 ERROR: s6-rc: info: service fix-attrs: stopping
13:58:52 ERROR: s6-rc: info: service fix-attrs successfully stopped
13:58:52 ERROR: s6-rc: info: service s6rc-oneshot-runner: stopping
13:58:52 ERROR: s6-rc: info: service s6rc-oneshot-runner successfully stopped
13:58:52 ERROR: For more information, review: https://deploy-docs.aptible.com/docs/http-health-checks-failed/
13:58:52 ERROR: FAILED: Wait up to 300s for containers in service web to respond to HTTP health checks

Thanks for your help!

ivoanjo commented 11 months ago

Thanks for the reply!

So I guess this settles it -- the signal causing the app to fail is definitely coming from the profiler, since now we're seeing the error that an unhandled SIGTRAP causes:

13:58:52 ERROR: Trace/breakpoint trap (core dumped)

and the status code also matches sigtrap (133-128 => 5)

13:58:52 ERROR: Container for web (running ["bundle", "exec", "ddtracerb", "exec", "rails", "server", "-b", "0.0.0.0"]) failed to start (exited with status 133 after 11 seconds).

Would you be willing to share with me the core dump that was generated, so I could investigate further? I need to dig deeper to understand what's happening here where the profiler seems to be sending signals but not handling them properly. I'm definitely at a loss at what's up here, this issue is definitely unique :)

If so, you'll probably want to share the core dump privately -- if you drop a note to support at https://docs.datadoghq.com/help/ and tell them you need to send a large file to debug an issue, they'll guide you through the steps to get it across :)

arich commented 11 months ago

I'm not sure how to get you the core dump! We're deploying with Aptible, a service a lot like heroku, running a command line script to deploy. Would the core dumps be on the local computer of the developer who deployed? Or somewhere in Aptible? I'm guessing the latter. I'm pulling the logs from their website. I can spin up a support ticket with them to try to get additional logs from the deployment failure.

ivoanjo commented 11 months ago

I spent some time looking through the aptible docs as well.

There's "aptible ssh" but it can't access an existing instance, it always starts a new one, and if the crash is very rare, so it doesn't seem to be very feasible to get it there.

I'm thinking, since they seem to have enabled the capability to get core dumps at the kernel level (hence the message that core dumps were already enabled), maybe they use such a facility themselves, and have some tooling to access it.Thus, it may be worth asking support if they have anything in place for that.

arich commented 11 months ago

Right, I asked:

Hello! I'm still trying to track down this issue. Datadog has given us a branch that executes a core dump when the deploy fails. Is it possible you could get us the core dump associated with this deployment failure? https://app.aptible.com/operations/60840806

They responded:

Thanks for contacting support. Just to make sure, are you asking for a dump within that failed deployment? Is it saving locally? Because that container would have been terminated and cleared from a failed deployment. Could it be possible for Datadog to forward those dumps to an S3 bucket or some other destination?

So, seems like no, but I could push them on if they have anything to capture it.

ivoanjo commented 11 months ago

Ah, that's a shame. We'll need to do it the harder way, sorry! :(

So when a core dump gets generated, there's a setting to tell Linux where to place it and how to name it. This is the core pattern option.

You can probably use the aptible ssh feature to start a container and look at the /proc/sys/kernel/core_pattern file, and then figure out where the files are being placed.

E.g. here's an example from my machine, where I cause a crash to happen to check the core dump as well:

$ cat /proc/sys/kernel/core_pattern
core
$ bundle exec ruby -e "require 'datadog/kit/enable_core_dumps'; Process.kill('TRAP', Process.pid)"
[ddtrace] Enabled core dumps. Maximum size: 18446744073709551615 Output pattern: 'core'
Trace/breakpoint trap (core dumped)
$ ls -lah core.*
-rw------- 1 ivo.anjo ivo.anjo 87M Dec  1 14:23 core.443703
-rw------- 1 ivo.anjo ivo.anjo 73M Dec  1 14:29 core.452239

# Just as an extra example, here's how to see the stack trace of the VM when it crashed,
# given the core dump:
$ gdb ruby core.452239
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0  0x000073e954e4275b in kill () at ../sysdeps/unix/syscall-template.S:120
120 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x000073e954e4275b in kill () at ../sysdeps/unix/syscall-template.S:120
#1  0x000073e955435381 in rb_f_kill (argc=2, argv=0x73e9546ff048) at signal.c:486
#2  0x000073e9554b70a7 in vm_call_cfunc_with_frame (ec=0x5a706a9c49b0, reg_cfp=0x73e9547fef90, calling=<optimised out>)
    at /usr/share/rvm/src/ruby-3.1.4/vm_insnhelper.c:3037
#3  0x000073e9554c62b4 in vm_sendish (method_explorer=<optimised out>, block_handler=<optimised out>, cd=<optimised out>, reg_cfp=<optimised out>, 
    ec=<optimised out>) at /usr/share/rvm/src/ruby-3.1.4/vm_callinfo.h:349
#4  vm_exec_core (ec=0x5a706a9c49b0, initial=5) at /usr/share/rvm/src/ruby-3.1.4/insns.def:778
#5  0x000073e9554cba8f in rb_vm_exec (ec=0x5a706a9c49b0, mjit_enable_p=true) at vm.c:2211
#6  0x000073e9552dbf51 in rb_ec_exec_node (ec=ec@entry=0x5a706a9c49b0, n=n@entry=0x73e9511bf388) at eval.c:280
#7  0x000073e9552e244a in ruby_run_node (n=0x73e9511bf388) at eval.c:321
#8  0x00005a706a8ae17f in main (argc=<optimised out>, argv=<optimised out>) at ./main.c:47
(gdb) 

After getting the core file, it's usually helpful to compress it (since it tends to be a few tens/hundreds of megabytes and usually compresses well), and then use some tool to upload it somewhere.

If you have a way to upload it to somewhere (e.g. S3), that works great! If not, there's an internal support process for Datadog to help out, so let me know if you prefer to use that approach.

arich commented 11 months ago

Right, but will this be helpful if it's just from a random ssh container? We still have the issue that we can't get the core dump from the deployment container, because it's killed immediately after the deploy fails. Right?

ivoanjo commented 11 months ago

Right, but will this be helpful if it's just from a random ssh container? We still have the issue that we can't get the core dump from the deployment container, because it's killed immediately after the deploy fails. Right?

Yes, that's right. I mentioned using aptible ssh since the first step is figuring out where the core dump is getting (erm...) dumped, and I assumed they provide the same configuration on those app instances.

After that comes the more annoying part, which is somehow tweaking the running scripts (s6-overlay?) to automatically go to that folder, zip up the core, and upload it when the crash gets detected.

I realize that it's a big ask; so if you prefer to stay with the "no signals" workaround, since we've seen it works fine, that's always an option too -- we plan for that feature to stay.

arich commented 11 months ago

Gotcha. Yeah we're going to go back to the "no signals" workaround for now and may revisit this later on. This is our busy season so I don't see this happening for ~3 months at least as long as we have a good workaround. I'm not super proficient when it comes to capturing this specific error in S6-overlay and uploading it somewhere, so yeah that does sound more complex than I'm willing to invest in here.

Thank you so much for your help and patience on this long running issue! Closing.

ivoanjo commented 11 months ago

Thanks for all the info! This is/was definitely a though one :bow:.