DataDog / datadog-ci-rb

Ruby library for Datadog test visibility
https://docs.datadoghq.com/continuous_integration/tests/ruby
Other
8 stars 4 forks source link

SSL Validation issues when submitting traces due to time shifting in tests #247

Open costi opened 1 week ago

costi commented 1 week ago

Is your feature request related to a problem? Please describe. I have a ruby app using libfaketime, which is loaded with LD_PRELOAD and changes the LIBC level time time functions. I use faketime because I need to have the time synchronized between the ruby app and postgres. I think most people doing ruby/rails are using Timecop, which doesn't stub the LIBC level functions. My problem I often fail SSL certificate validations with datadog-ci when the thread is submitting the test in the worker thread.

callback_traces') Error during traces flush: dropped 1 items. 
Cause: SSL_connect returned=1 errno=0 state=error: certificate verify failed 
Location: /opt/rubies/ruby-2.4.9/lib/ruby/2.4.0/net/protocol.rb:44:in connect_nonblock'

Describe the goal of the feature I would like to continue using the worker thread for performance reasons, but I need a way to temporarily reset the time to the actual system time during the HTTP trace submission to avoid SSL validation issues. After the submission, the time can revert to the test time. Ideally, I would like to wrap the HTTP "send" operation in a Thread.exclusive block where the time is restored to real-time for the duration of the network operation.

Could you provide guidance on whether this is possible or suggest an alternative approach for handling this?

Describe alternatives you've considered I found this option that makes datadog-ci send the data syncronously config.tracing.test_mode.enabled = true and datadog flushes the traces after each example. Between examples the time is correct because I change the time in before :each blocks. This works but I bet it's not the recommended way.

Additional context I've considered switching to synchronous flushing by enabling config.tracing.test_mode.enabled = true, which makes datadog-ci flush traces after each test example. This works because the time is correct between examples (I manipulate time in before :each blocks). However, this approach reduces the performance benefits of using asynchronous trace submission. I 'm also worried on the extra load placed on datadog servers with the sync flush. I also don't know how this test_mode.enabled will handle datadog being down. The async worker drops the trace and continues. The app we're using datadog is HUGE, and tests are large, running in 60 workers for about 15 minutes. From the initial test runs, I saw we have about 10k rspec "examples".

How does datadog-ci help you? We are in the process of onboarding datadog-ci to enhance our test observability and performance monitoring. Although we're early in the adoption phase, we see great potential in the tool and are eager to fully integrate it into our workflow.

anmarchenko commented 1 week ago

Hi @costi! Thanks for reporting this. Today I learned about libfaketime, I'll see if we can support it out of the box in the future.

Your suggestion with monkey patching the transport layer to return real time before HTTP call could work. Using Thread.exclusive could lead to horrible performance degradation though: as far as I understand it will make all other threads to wait for HTTP call to finish. Try it and tell me if it really works for you.

As for the best place to do it, this will depend on the library version you use because in the last year the library had a lot of development. Also, bear in mind that you will have to adapt the monkey patch with almost every library upgrade because it changed a lot.

You mention that config.tracing.test_mode.enabled makes trace submission synchronous: this is no longer true since a long time; also we dropped support for Ruby versions before 2.7 and your Ruby version is 2.4.9, so your library version is most likely a very old one.

What datadog-ci version do you use?

costi commented 1 week ago

This is my Gemfile.lock on my branch where I'm adding datadog-ci. I'm actually upgrading from ddtrace (0.46.0) to (1.23.3).

    datadog-ci (0.8.3)
      msgpack
    ddtrace (1.23.3)
      datadog-ci (~> 0.8.1)
      debase-ruby_core_source (= 3.3.1)
      libdatadog (~> 7.0.0.1.0)
      libddwaf (~> 1.14.0.0.0)
      msgpack

Due to our version of ruby this is highest ddtrace version we can use. Still, a big step up from 0.46.0. We're planning a ruby upgrade later this year and since you're moving so fast with datadog-ci-rb I don't want to do deep integrations into a moving target. Since you already removed test mode, I'm going to try now to switch to a combination of Timecop in ruby + libfaketime in postgres.

anmarchenko commented 1 week ago

The timecop would cause a lot less headaches indeed.

If you use datadog-ci earlier than 1.0, you will have to set time provider manually to have your traces traced with correct time as described here.

Starting with version 1.0 datadog-ci does it automatically, so no manual configuration is needed.