Closed CrazyPit closed 2 years ago
Hi @CrazyPit ,
Thx for the report.
There was a change related with webmock in that release, but I don't think it relates to the issue you're describing.
Could you provide a contained reproduction of the case? The snippets you give me don't give me all details, as I don't see:
origin
are you passing to HTTPX
;MailHogHelper
calls are hiding anything webmock-related;Here's a gist confirming the issue is not the delete request:
require "webmock"
require "httpx"
require "httpx/adapters/webmock"
include WebMock::API
WebMock.enable!
WebMock.disable_net_connect!
stub_request(:delete, "http://www.example.com/api/v1/messages").
to_return(body: "abc", status: 200,
headers: { 'Content-Length' => 3 })
session = HTTPX.with(origin: "http://www.example.com")
.plugin(:persistent)
.with_timeout(connect_timeout: 5, total_timeout: 5)
response = session.delete('/api/v1/messages')
puts response.to_s #=> 'abc'
I'm going to close it for now. Feel free to reopen the ticket when you have a reproduction proving that the issue came from httpx
.
Ok, sure. It's hard to reproduce it locally, but seems it's stable for a docker image. I just need to prepare a one without a sensitive company's code that I can't share.
But this definitely related to HTTPX, since the diff for failing PR is:
diff --git a/Gemfile b/Gemfile
index 2725c04..1da6640 100644
--- a/Gemfile
+++ b/Gemfile
@@ -50,7 +50,7 @@ gem 'mail'
# HTTP interaction
-gem 'httpx', '0.17.0'
+gem 'httpx', '0.18.0'
# SMTP interaction
gem 'net-smtp'
diff --git a/Gemfile.lock b/Gemfile.lock
index 9012df3..7822341 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -234,9 +234,8 @@ GEM
googleapis-common-protos-types (~> 1.0)
hashdiff (1.0.1)
http-2-next (0.5.0)
- httpx (0.17.0)
+ httpx (0.18.0)
http-2-next (>= 0.4.1)
- timers
i18n (1.8.11)
concurrent-ruby (~> 1.0)
io-console (0.5.9)
@@ -405,7 +404,6 @@ GEM
activerecord (>= 5)
thor (1.1.0)
timeout (0.2.0)
- timers (4.3.3)
tzinfo (2.0.4)
concurrent-ruby (~> 1.0)
tzinfo-data (1.2021.5)
@@ -449,7 +447,7 @@ DEPENDENCIES
dry-schema!
dry-validation!
event_schemas!
- httpx (= 0.17.0)
+ httpx (= 0.18.0)
karafka
karafka-testing
mail
🤷
I don't doubt you, but I need an actual way to reproduce and fix the issue here. Besides a reproduction like the one I gave you thst you can use as a template, you can also try to run that test with HTTPX_DEBUG=2
and paste the output here; you can bisect between 0.17 and 0.18 locally and pinpoint the commit that broke your test. Either way, I'll need some more data.
locally and pinpoint the commit that broke your test.
Yeah, that basically was my idea.
Hey. This commit is a problem: https://github.com/HoneyryderChuck/httpx/commit/b2a5371fe21c5d5310b858be39182e55d54523e7.
So basically when I use gem 'httpx', github: 'HoneyryderChuck/httpx', ref: '8f00539d6d983ede598f2dc48bf7c60ef2a14a82'
it is ok.
When I use gem 'httpx', github: 'HoneyryderChuck/httpx', ref: 'b2a5371fe21c5d5310b858be39182e55d54523e7'
then spec is not responding.
Alright, thx for digging into this @CrazyPit .
So, the code path blocking you seems to be the one raising the total timeout exception. Looking at it, I'm not sure what caused this, as it seems a legit change which is used a bit everywhere in the codebase.
Do you know how to work with "pry-byebug"? If so, could you insert a breakpoint before the ex.set_backtrace(caller)
line and see where that takes you? Im assuming you can run this in your machine locally, so open your local copy of httpx
by doing bundle open httpx
.
Also, would be great if you could provide a log dump when running that test with HTTPX_DEBUG=2
(and delete any company info from it beforehand).
Ok, I'll try it soon.
First offending commit according to my testing is 81a41d889c0cb7931b7dd103c600c15624111007
call rb_backtrace()
in gdb
during the hang says
from /home/ojab/src/httpx/lib/httpx/chainable.rb:8:in `delete'
from /home/ojab/src/httpx/lib/httpx/session.rb:36:in `request'
from /home/ojab/src/httpx/lib/httpx/session.rb:181:in `send_requests'
from /home/ojab/src/httpx/lib/httpx/session.rb:205:in `receive_requests'
from /home/ojab/src/httpx/lib/httpx/session.rb:205:in `loop'
from /home/ojab/src/httpx/lib/httpx/session.rb:210:in `block in receive_requests'
from /home/ojab/src/httpx/lib/httpx/pool.rb:29:in `next_tick'
from /home/ojab/src/httpx/lib/httpx/pool.rb:29:in `catch'
from /home/ojab/src/httpx/lib/httpx/pool.rb:37:in `block in next_tick'
from /home/ojab/src/httpx/lib/httpx/selector.rb:122:in `select'
from /home/ojab/src/httpx/lib/httpx/selector.rb:75:in `select_many'
from /home/ojab/src/httpx/lib/httpx/selector.rb:75:in `select'
Log with HTTPX_DEBUG=2
: https://gist.github.com/632a80e6851b0193fb5ff6d42c1507a1
@ojab is this the same issue being reported by @CrazyPit ? Do you have a reproduction script I could use?
If not, can you set up a breakpoint where that TotalTimeoutError
exception is triggered, and find out:
@origin
of the connection;Also, are these persistent connections? As per the docs, total timeout sets the lifetime for which a connection should be in use, so that'd be directly in conflict with keeping connections around.
Yeah, it's the same codebase. Unfortunately no reproducer, I tried to make one, but failed. It's reproducible during bundle exec rspec
in one of our projects though.
Connections to the mailhog (127.17.0.2) are done via session = HTTPX.with(origin: origin).plugin(:persistent).with_timeout(connect_timeout: 5, total_timeout: 5)
and connections to the sendgrid are made via [webmocked] def session; @session ||= HTTPX.with(origin: API_URL).plugin(:persistent); end
, so both are persisted.
[6] pry(#<#<Class:0x00000000059d79c0>>)> whereami
From: /home/ojab/src/httpx/lib/httpx/connection.rb:244 HTTPX::Connection#timeout:
235: def timeout
236: if @total_timeout
237: return @total_timeout unless @connected_at
238:
239: elapsed_time = @total_timeout - Utils.elapsed_time(@connected_at)
240:
241: if elapsed_time.negative?
242: ex = TotalTimeoutError.new(@total_timeout, "Timed out after #{@total_timeout} seconds")
243: binding.pry
=> 244: ex.set_backtrace(caller)
245: on_error(ex)
246: return
247: end
248:
249: return elapsed_time
250: end
251:
252: return @timeout if defined?(@timeout)
253:
254: return @options.timeout[:connect_timeout] if @state == :idle
255:
256: @options.timeout[:operation_timeout]
257: end
[7] pry(#<#<Class:0x00000000059d79c0>>)> @origin
=> #<URI::HTTP http://172.17.0.2:8025>
[8] pry(#<#<Class:0x00000000059d79c0>>)> ex
=> #<HTTPX::TotalTimeoutError: Timed out after 5 seconds>
…
[12] pry(#<#<Class:0x00000000059d79c0>>)> pending
=> []
….and I have no idea how to check the number of outstanding requests if it's not pending
.
I think we can conclude a few things from here:
I believe that the behaviour doesn't come from the commit you mentioned, that's just the commit you were able to reproduce it with. The behaviour comes actually from this MR, which changed the way total timeout works.
I think that the fact it worked before was just pure chance, and a quirk from how the timers
gem was integrated. The issue is that the total_timeout
feature, or at least what people expect it to be, is fundamentally incompatible with the persistent
plugin. I remember we had this conversation about the inconsistency of that timeout in some other thread. But fundamentally, if connection remains open, and total timeout is the "total connection timeout", then it's going to happen all the time.
Still, it shouldn't have happened, because the persistent
plugin is built on top of the retries
plugin, which should have at least retried after that total timeout (so, this total timeout feature, totally useless...). So why didn't it? Because the error should have cause the outstanding request to have landed here. That should have been guaranteed by this error handler, which should have caught the "total timeout error", propagate it to outstanding requests, and the retry logic should have had picked up on .fetch_response.
Could you try setting the breakpoint in the same place, follow the callstack, and see why it doesn't retry the request? There could be a bug there.
But the overall message is: don't mix persistent plugin and total timeout, as we've already established that it's not a "total request timeout", which I think is what you want (and I'm still planning to implement).
Reproducer ended up quite simple:
require 'httpx'
session = HTTPX.with(origin: 'http://google.com')
.plugin(:persistent)
.with_timeout(total_timeout: 1)
session.get('/')
sleep 2
session.get('/')
and it points me to https://github.com/HoneyryderChuck/httpx/commit/efddd72caa8fa6cdd1ea15ee8e298dc0d780e891
Many thx for the repro script, it made it very easy to pinpoint what went on.
You can follow this MR with the fix.
As I suspected, total timeout errors should lead to at least 1 request retry, and the patched bug was on the way.
v0.18.2 has been released . Will close the issue, feel free to reopen in case it isn't fixed yet.
Great, thanks a lot. 18.0.2 update PR is green 🎉
We are using rspec + WebMock. After updating from 0.17.0 to 0.18.0 HTTPX start hanging on some requests. Hard to debug since it mostly hangs on CI. But overall looks like in some cases
HTTPX.delete
request is not finishing.How do we use HTTPX for this case:
How do we use WebMock:
Not sure about
delete
method, but I didn't get this issue for other methods that are using during the test (post, put, get).