HTTPX requests are not responding in a test suite after 0.17.0->0.18.0 update

CrazyPit commented 2 years ago

We are using rspec + WebMock. After updating from 0.17.0 to 0.18.0 HTTPX start hanging on some requests. Hard to debug since it mostly hangs on CI. But overall looks like in some cases HTTPX.delete request is not finishing.

How do we use HTTPX for this case:

@session = HTTPX.with(origin: origin).plugin(:persistent).with_timeout(connect_timeout: 5, total_timeout: 5)

...

  def delete_messages
    response = session.delete('/api/v1/messages')
    raise unless response.status == 200
  end

  # Disable JIM
  def disable_jim
    response = session.delete(JIM_ENDPOINT)
    raise unless [200, 404].include?(response.status)
  end

How do we use WebMock:

  config.around(:each, :mailhog) do |example|
    WebMock.disable_net_connect!(allow: config.mailhog.http_url)
    MailHogHelper.mailhog.delete_messages
    MailHogHelper.mailhog.enable_jim

    example.run

    WebMock.disable_net_connect!
  end

config.after(:suite) do
    MailHogHelper.mailhog.disable_jim
end

Not sure about delete method, but I didn't get this issue for other methods that are using during the test (post, put, get).

HoneyryderChuck commented 2 years ago

Hi @CrazyPit ,

Thx for the report.

There was a change related with webmock in that release, but I don't think it relates to the issue you're describing.

Could you provide a contained reproduction of the case? The snippets you give me don't give me all details, as I don't see:

which origin are you passing to HTTPX;
whether all deletes fail, or just one (you seem to define two delete functions);
whether those MailHogHelper calls are hiding anything webmock-related;

HoneyryderChuck commented 2 years ago

Here's a gist confirming the issue is not the delete request:

require "webmock"
require "httpx"
require "httpx/adapters/webmock"

include WebMock::API

WebMock.enable!
WebMock.disable_net_connect!

stub_request(:delete, "http://www.example.com/api/v1/messages").
    to_return(body: "abc", status: 200,
        headers: { 'Content-Length' => 3 })

session = HTTPX.with(origin: "http://www.example.com")
                .plugin(:persistent)
                .with_timeout(connect_timeout: 5, total_timeout: 5)

response = session.delete('/api/v1/messages')
puts response.to_s #=> 'abc'

I'm going to close it for now. Feel free to reopen the ticket when you have a reproduction proving that the issue came from httpx.

CrazyPit commented 2 years ago

Ok, sure. It's hard to reproduce it locally, but seems it's stable for a docker image. I just need to prepare a one without a sensitive company's code that I can't share.

CrazyPit commented 2 years ago

But this definitely related to HTTPX, since the diff for failing PR is:

diff --git a/Gemfile b/Gemfile
index 2725c04..1da6640 100644
--- a/Gemfile
+++ b/Gemfile
@@ -50,7 +50,7 @@ gem 'mail'

 # HTTP interaction
-gem 'httpx', '0.17.0'
+gem 'httpx', '0.18.0'

 # SMTP interaction
 gem 'net-smtp'
diff --git a/Gemfile.lock b/Gemfile.lock
index 9012df3..7822341 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -234,9 +234,8 @@ GEM
       googleapis-common-protos-types (~> 1.0)
     hashdiff (1.0.1)
     http-2-next (0.5.0)
-    httpx (0.17.0)
+    httpx (0.18.0)
       http-2-next (>= 0.4.1)
-      timers
     i18n (1.8.11)
       concurrent-ruby (~> 1.0)
     io-console (0.5.9)
@@ -405,7 +404,6 @@ GEM
       activerecord (>= 5)
     thor (1.1.0)
     timeout (0.2.0)
-    timers (4.3.3)
     tzinfo (2.0.4)
       concurrent-ruby (~> 1.0)
     tzinfo-data (1.2021.5)
@@ -449,7 +447,7 @@ DEPENDENCIES
   dry-schema!
   dry-validation!
   event_schemas!
-  httpx (= 0.17.0)
+  httpx (= 0.18.0)
   karafka
   karafka-testing
   mail

🤷

HoneyryderChuck commented 2 years ago

I don't doubt you, but I need an actual way to reproduce and fix the issue here. Besides a reproduction like the one I gave you thst you can use as a template, you can also try to run that test with HTTPX_DEBUG=2 and paste the output here; you can bisect between 0.17 and 0.18 locally and pinpoint the commit that broke your test. Either way, I'll need some more data.

CrazyPit commented 2 years ago

locally and pinpoint the commit that broke your test.

Yeah, that basically was my idea.

CrazyPit commented 2 years ago

Hey. This commit is a problem: https://github.com/HoneyryderChuck/httpx/commit/b2a5371fe21c5d5310b858be39182e55d54523e7.

So basically when I use gem 'httpx', github: 'HoneyryderChuck/httpx', ref: '8f00539d6d983ede598f2dc48bf7c60ef2a14a82' it is ok.

When I use gem 'httpx', github: 'HoneyryderChuck/httpx', ref: 'b2a5371fe21c5d5310b858be39182e55d54523e7' then spec is not responding.

HoneyryderChuck commented 2 years ago

Alright, thx for digging into this @CrazyPit .

So, the code path blocking you seems to be the one raising the total timeout exception. Looking at it, I'm not sure what caused this, as it seems a legit change which is used a bit everywhere in the codebase.

Do you know how to work with "pry-byebug"? If so, could you insert a breakpoint before the ex.set_backtrace(caller) line and see where that takes you? Im assuming you can run this in your machine locally, so open your local copy of httpx by doing bundle open httpx.

Also, would be great if you could provide a log dump when running that test with HTTPX_DEBUG=2 (and delete any company info from it beforehand).

CrazyPit commented 2 years ago

Ok, I'll try it soon.

ojab commented 2 years ago

First offending commit according to my testing is 81a41d889c0cb7931b7dd103c600c15624111007 call rb_backtrace() in gdb during the hang says

    from /home/ojab/src/httpx/lib/httpx/chainable.rb:8:in `delete'
    from /home/ojab/src/httpx/lib/httpx/session.rb:36:in `request'
    from /home/ojab/src/httpx/lib/httpx/session.rb:181:in `send_requests'
    from /home/ojab/src/httpx/lib/httpx/session.rb:205:in `receive_requests'
    from /home/ojab/src/httpx/lib/httpx/session.rb:205:in `loop'
    from /home/ojab/src/httpx/lib/httpx/session.rb:210:in `block in receive_requests'
    from /home/ojab/src/httpx/lib/httpx/pool.rb:29:in `next_tick'
    from /home/ojab/src/httpx/lib/httpx/pool.rb:29:in `catch'
    from /home/ojab/src/httpx/lib/httpx/pool.rb:37:in `block in next_tick'
    from /home/ojab/src/httpx/lib/httpx/selector.rb:122:in `select'
    from /home/ojab/src/httpx/lib/httpx/selector.rb:75:in `select_many'
    from /home/ojab/src/httpx/lib/httpx/selector.rb:75:in `select'

Log with HTTPX_DEBUG=2: https://gist.github.com/632a80e6851b0193fb5ff6d42c1507a1

HoneyryderChuck commented 2 years ago

@ojab is this the same issue being reported by @CrazyPit ? Do you have a reproduction script I could use?

If not, can you set up a breakpoint where that TotalTimeoutError exception is triggered, and find out:

what's the @origin of the connection;
how many requests does it still have;

Also, are these persistent connections? As per the docs, total timeout sets the lifetime for which a connection should be in use, so that'd be directly in conflict with keeping connections around.

ojab commented 2 years ago

Yeah, it's the same codebase. Unfortunately no reproducer, I tried to make one, but failed. It's reproducible during bundle exec rspec in one of our projects though. Connections to the mailhog (127.17.0.2) are done via session = HTTPX.with(origin: origin).plugin(:persistent).with_timeout(connect_timeout: 5, total_timeout: 5) and connections to the sendgrid are made via [webmocked] def session; @session ||= HTTPX.with(origin: API_URL).plugin(:persistent); end, so both are persisted.

[6] pry(#<#<Class:0x00000000059d79c0>>)> whereami

From: /home/ojab/src/httpx/lib/httpx/connection.rb:244 HTTPX::Connection#timeout:

    235: def timeout
    236:   if @total_timeout
    237:     return @total_timeout unless @connected_at
    238: 
    239:     elapsed_time = @total_timeout - Utils.elapsed_time(@connected_at)
    240: 
    241:     if elapsed_time.negative?
    242:       ex = TotalTimeoutError.new(@total_timeout, "Timed out after #{@total_timeout} seconds")
    243:       binding.pry
 => 244:       ex.set_backtrace(caller)
    245:       on_error(ex)
    246:       return
    247:     end
    248: 
    249:     return elapsed_time
    250:   end
    251: 
    252:   return @timeout if defined?(@timeout)
    253: 
    254:   return @options.timeout[:connect_timeout] if @state == :idle
    255: 
    256:   @options.timeout[:operation_timeout]
    257: end

[7] pry(#<#<Class:0x00000000059d79c0>>)> @origin
=> #<URI::HTTP http://172.17.0.2:8025>
[8] pry(#<#<Class:0x00000000059d79c0>>)> ex
=> #<HTTPX::TotalTimeoutError: Timed out after 5 seconds>
…
[12] pry(#<#<Class:0x00000000059d79c0>>)> pending
=> []

….and I have no idea how to check the number of outstanding requests if it's not pending.

HoneyryderChuck commented 2 years ago

I think we can conclude a few things from here:

The timeout error comes from the mailhog client, which isn't webmocked;
the timeout triggers after sendgrid; I don't see timestamps there, but I assume some time passes between the last successful mailhog request, and the one that ultimately fails;
the error happens while calculating the wait timeout;

I believe that the behaviour doesn't come from the commit you mentioned, that's just the commit you were able to reproduce it with. The behaviour comes actually from this MR, which changed the way total timeout works.

I think that the fact it worked before was just pure chance, and a quirk from how the timers gem was integrated. The issue is that the total_timeout feature, or at least what people expect it to be, is fundamentally incompatible with the persistent plugin. I remember we had this conversation about the inconsistency of that timeout in some other thread. But fundamentally, if connection remains open, and total timeout is the "total connection timeout", then it's going to happen all the time.

Still, it shouldn't have happened, because the persistent plugin is built on top of the retries plugin, which should have at least retried after that total timeout (so, this total timeout feature, totally useless...). So why didn't it? Because the error should have cause the outstanding request to have landed here. That should have been guaranteed by this error handler, which should have caught the "total timeout error", propagate it to outstanding requests, and the retry logic should have had picked up on .fetch_response.

Could you try setting the breakpoint in the same place, follow the callstack, and see why it doesn't retry the request? There could be a bug there.

But the overall message is: don't mix persistent plugin and total timeout, as we've already established that it's not a "total request timeout", which I think is what you want (and I'm still planning to implement).

ojab commented 2 years ago

Reproducer ended up quite simple:

require 'httpx'

session = HTTPX.with(origin: 'http://google.com')
  .plugin(:persistent)
  .with_timeout(total_timeout: 1)

session.get('/')
sleep 2
session.get('/')

and it points me to https://github.com/HoneyryderChuck/httpx/commit/efddd72caa8fa6cdd1ea15ee8e298dc0d780e891

HoneyryderChuck commented 2 years ago

Many thx for the repro script, it made it very easy to pinpoint what went on.

You can follow this MR with the fix.

As I suspected, total timeout errors should lead to at least 1 request retry, and the patched bug was on the way.

HoneyryderChuck commented 2 years ago

v0.18.2 has been released . Will close the issue, feel free to reopen in case it isn't fixed yet.

CrazyPit commented 2 years ago

Great, thanks a lot. 18.0.2 update PR is green 🎉

HoneyryderChuck / httpx

HTTPX requests are not responding in a test suite after 0.17.0->0.18.0 update #3