Closed asabil closed 3 months ago
@asabil Hm, I possibly haven’t set any timeouts which will hang it forever. I‘m open to a PR or I’ll investigate this next week.
Added some printout in oidcc_http_util:request/4
and it actually looks like there is a "timer" problem:
[1720176242986], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration",[]} [{timeout,60000}] default
[1720176243470], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176258475], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration", []} [{timeout,60000}] default
[1720176258526], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176258647], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176273530], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration", []} [{timeout,60000}] default
[1720176273658], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176273806], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176273852], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288659], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration", []} [{timeout,60000}] default
[1720176288705], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288811], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288854], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288905], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
So it looks like we somehow end up being blocked by auth0 because the number of jwks
requests keep accumulating (1, 2, 3, 4...).
For reference, this is the print out code:
io:format("[~B], HTTP: ~p ~p ~p ~p~n", [
os:system_time(millisecond),
Method,
Request,
HttpOpts,
HttpProfile
]),
Hm, are they setting a strange cache header that leads to continuous refreshing?
Looks like this helps at least with the accumlation of timers:
diff --git a/src/oidcc_provider_configuration_worker.erl b/src/oidcc_provider_configuration_worker.erl
index d4539ec..a9cefa1 100644
--- a/src/oidcc_provider_configuration_worker.erl
+++ b/src/oidcc_provider_configuration_worker.erl
@@ -224,7 +224,8 @@ handle_continue(
%% @private
handle_info(backoff_retry, State) ->
{noreply, State, {continue, load_configuration}};
-handle_info(configuration_expired, State) ->
+handle_info(configuration_expired, #state{jwks_refresh_timer = JwksRefreshTimer} = State) ->
+ maybe_cancel_timer(JwksRefreshTimer),
{noreply, State#state{configuration_refresh_timer = undefined, jwks_refresh_timer = undefined},
{continue, load_configuration}};
handle_info(jwks_expired, State) ->
That being said, from my own experience, I think that it is better to use erlang:start_timer
instead of timer:send_after
as the former allows for matching against the currently active timer and avoid creating "parallel" timers such as in this case.
For example, with the current code I think it is still possible to have a race condition where the timer being canceled has already been fired and the message corresponding to the timeout is in the mailbox but not yet processed.
Hm, are they setting a strange cache header that leads to continuous refreshing?
Yes, 15 seconds:
{"cache-control",
"public, max-age=15, stale-while-revalidate=15, stale-if-error=86400"},
@asabil Sorry for the long wait.
I can't reproduce the issue. There should only be one single message triggered after the jwk reload is successful. You are seeing hundreds of messages in the queue.
Can you try to log each time handle_info
or handle_continue
is called together with the message_queue_len
so that we can figure out what path those messages take so that they can accumulate?
Sorry for the delay, finally got to add some logs:
[ts: 1722866476154, qlen: 0] handle_continue(load_configuration)
[ts: 1722866476629, qlen: 0] handle_continue(load_jwks)
[ts: 1722866491632, qlen: 0] handle_info(configuration_expired)
[ts: 1722866491633, qlen: 0] handle_continue(load_configuration)
[ts: 1722866491662, qlen: 0] handle_continue(load_jwks)
[ts: 1722866491811, qlen: 0] handle_info(jwks_expired)
[ts: 1722866491811, qlen: 0] handle_continue(load_jwks)
[ts: 1722866506666, qlen: 0] handle_info(configuration_expired)
[ts: 1722866506667, qlen: 0] handle_continue(load_configuration)
[ts: 1722866506793, qlen: 1] handle_continue(load_jwks)
[ts: 1722866506918, qlen: 1] handle_info(jwks_expired)
[ts: 1722866506918, qlen: 1] handle_continue(load_jwks)
[ts: 1722866506945, qlen: 0] handle_info(jwks_expired)
[ts: 1722866506945, qlen: 0] handle_continue(load_jwks)
[ts: 1722866521798, qlen: 0] handle_info(configuration_expired)
[ts: 1722866521799, qlen: 0] handle_continue(load_configuration)
[ts: 1722866521888, qlen: 0] handle_continue(load_jwks)
[ts: 1722866521918, qlen: 0] handle_info(jwks_expired)
[ts: 1722866521919, qlen: 0] handle_continue(load_jwks)
[ts: 1722866521946, qlen: 0] handle_info(jwks_expired)
[ts: 1722866521946, qlen: 0] handle_continue(load_jwks)
[ts: 1722866522394, qlen: 0] handle_info(jwks_expired)
[ts: 1722866522394, qlen: 0] handle_continue(load_jwks)
[ts: 1722866536889, qlen: 0] handle_info(configuration_expired)
[ts: 1722866536890, qlen: 0] handle_continue(load_configuration)
[ts: 1722866537015, qlen: 2] handle_continue(load_jwks)
[ts: 1722866537050, qlen: 1] handle_info(jwks_expired)
[ts: 1722866537051, qlen: 1] handle_continue(load_jwks)
[ts: 1722866537080, qlen: 0] handle_info(jwks_expired)
[ts: 1722866537080, qlen: 0] handle_continue(load_jwks)
[ts: 1722866537398, qlen: 0] handle_info(jwks_expired)
[ts: 1722866537398, qlen: 0] handle_continue(load_jwks)
[ts: 1722866537510, qlen: 0] handle_info(jwks_expired)
[ts: 1722866537510, qlen: 0] handle_continue(load_jwks)
I added empty lines to group together the printouts. As you can see we somehow end up with accumulating timers that lead to excessive calls.
The qlen is 0, because we will only start to see that once auth0 (in my case) starts to apply rate limiting because of the excessive calls.
I still need to test the patch I posted above extensively, but it should solve the problem.
With the patch above applied, the logs look like this:
[ts: 1722867041017, qlen: 0] handle_continue(load_configuration)
[ts: 1722867041670, qlen: 0] handle_continue(load_jwks)
[ts: 1722867056676, qlen: 0] handle_info(configuration_expired)
[ts: 1722867056676, qlen: 0] handle_continue(load_configuration)
[ts: 1722867056704, qlen: 0] handle_continue(load_jwks)
[ts: 1722867071707, qlen: 0] handle_info(configuration_expired)
[ts: 1722867071708, qlen: 0] handle_continue(load_configuration)
[ts: 1722867071847, qlen: 0] handle_continue(load_jwks)
[ts: 1722867086849, qlen: 0] handle_info(configuration_expired)
[ts: 1722867086849, qlen: 0] handle_continue(load_configuration)
[ts: 1722867086887, qlen: 0] handle_continue(load_jwks)
@asabil That looks promising. I’ll go ahead and merge the PR then. Thanks for debugging.
Which PR?
Just looked at the PR, that will not work. The fix is the one I posted in the comments above.
jwks_refresh_timer
is set to undefined
in handle_info
, which will cause the maybe_cancel_timer
to be a no-op in handle_continue
.
@asabil Oh, I misinterpreted your message. I'll revert the PR.
I believe you've pinpointed the exact issue though: The timer was cleared in the handle_info and so the timer cancellation can never work.
I'll revert and open a new PR to test.
@asabil Does the new PR work for you?
oidcc version
3.2.0
Erlang version
26.2.5
Elixir version
-
Summary
calls to
oidcc_provider_configuration_worker:get_provider_configuration/1
randomly times out. It seems that this is somehow related to httpc timing out?The following crash report is reported by cowboy:
Current behavior
Inspecting the worker state shows that it is hanging in
httpc:handle_answer/3
:Also, it seems like the worker process has accumulated a set of timer events:
Checking the stacktrace:
How to reproduce
Haven't been able to pinpoint the exact scenario leading to the hang/timeout yet.
Expected behavior
It shouldn't hang :)