oidcc_provider_configuration_worker:get_provider_configuration/1 randomly times out

asabil commented 3 months ago

oidcc version

3.2.0

Erlang version

26.2.5

Elixir version

-

Summary

calls to oidcc_provider_configuration_worker:get_provider_configuration/1 randomly times out. It seems that this is somehow related to httpc timing out?

The following crash report is reported by cowboy:

=CRASH REPORT==== 4-Jul-2024::11:14:21.242933 ===
  crasher:
    initial call: cowboy_stream_h:request_process/3
    pid: <0.2178.0>
    registered_name: []
    exception exit: {{timeout,
                         {gen_server,call,
                             [<0.1461.0>,get_provider_configuration]}},
                     [{gen_server,call,2,[{file,"gen_server.erl"},{line,404}]},
                      {oidcc_client_context,from_configuration_worker,4,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_client_context.erl"},
                           {line,137}]},
                      {myapp_authn_user,authenticate_with_jwt,1,
                          [{file,
                               "/workspace/myapp-server/apps/myapp_authn/src/myapp_authn_user.erl"},
                           {line,17}]},
                      {myapp_http_token_h,from,3,
                          [{file,
                               "/workspace/myapp-server/apps/myapp_http/src/handlers/myapp_http_token_h.erl"},
                           {line,100}]},
                      {cowboy_rest,call,3,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl"},
                           {line,1590}]},
                      {cowboy_rest,process_content_type,3,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl"},
                           {line,1096}]},
                      {cowboy_rest,upgrade,4,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl"},
                           {line,284}]},
                      {cowboy_stream_h,execute,3,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_stream_h.erl"},
                           {line,306}]}]}
      in function  gen_server:call/2 (gen_server.erl, line 404)
      in call from oidcc_client_context:from_configuration_worker/4 (/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_client_context.erl, line 137)
      in call from myapp_authn_user:authenticate_with_jwt/1 (/workspace/myapp-server/apps/myapp_authn/src/myapp_authn_user.erl, line 17)
      in call from myapp_http_token_h:from/3 (/workspace/myapp-server/apps/myapp_http/src/handlers/myapp_http_token_h.erl, line 100)
      in call from cowboy_rest:call/3 (/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl, line 1590)
      in call from cowboy_rest:process_content_type/3 (/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl, line 1096)
      in call from cowboy_rest:upgrade/4 (/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl, line 284)
      in call from cowboy_stream_h:execute/3 (/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_stream_h.erl, line 306)
    ancestors: [<0.2177.0>,<0.1475.0>,<0.1468.0>,<0.1467.0>,ranch_sup,
                  <0.1424.0>]
    message_queue_len: 0
    messages: []
    links: [<0.2177.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 28
    reductions: 2911
  neighbours:

Current behavior

Inspecting the worker state shows that it is hanging in httpc:handle_answer/3:

1> erlang:process_info(pid(0,1461,0)).
[{registered_name,myapp_authn_oidcc_auth0},
 {current_function,{httpc,handle_answer,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,139},
 {links,[<0.1460.0>]},
 {dictionary,[{rand_seed,{#{type => exsss,next => #Fun<rand.0.65977474>,bits => 58,
                            uniform => #Fun<rand.1.65977474>,
                            uniform_n => #Fun<rand.2.65977474>,
                            jump => #Fun<rand.3.65977474>},
                          [264808474144978743|268960576097479628]}},
              {'$initial_call',{oidcc_provider_configuration_worker,init,
                                                                    1}},
              {'$ancestors',[myapp_authn_sup,<0.1459.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1458.0>},
 {total_heap_size,24620},
 {heap_size,6772},
 {stack_size,65},
 {reductions,562812697},
 {garbage_collection,[{max_heap_size,#{error_logger => true,include_shared_binaries => false,
                                       kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,21}]},
 {suspending,[]}]

Also, it seems like the worker process has accumulated a set of timer events:

2> erlang:process_info(pid(0,1461,0), messages).
{messages,[jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired|...]}
3> io:format("~p~n", [erlang:process_info(pid(0,1461,0), messages)]).
{messages,[jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired]}
ok

Checking the stacktrace:

4> erlang:process_info(pid(0,1461,0), current_stacktrace).
{current_stacktrace,[{httpc,handle_answer,3,
                            [{file,"httpc.erl"},{line,865}]},
                     {httpc,handle_request,9,[{file,"httpc.erl"},{line,809}]},
                     {oidcc_http_util,'-request/4-fun-0-',5,
                                      [{file,"/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_http_util.erl"},
                                       {line,109}]},
                     {telemetry,span,3,
                                [{file,"/workspace/myapp-server/_build/default/lib/telemetry/src/telemetry.erl"},
                                 {line,321}]},
                     {oidcc_provider_configuration,load_jwks,2,
                                                   [{file,"/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_provider_configuration.erl"},
                                                    {line,264}]},
                     {oidcc_provider_configuration_worker,handle_continue,2,
                                                          [{file,"/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_provider_configuration_worker.erl"},
                                                           {line,212}]},
                     {gen_server,try_handle_continue,3,
                                 [{file,"gen_server.erl"},{line,1085}]},
                     {gen_server,loop,7,[{file,"gen_server.erl"},{line,995}]}]}

How to reproduce

Haven't been able to pinpoint the exact scenario leading to the hang/timeout yet.

Expected behavior

It shouldn't hang :)

maennchen commented 3 months ago

@asabil Hm, I possibly haven’t set any timeouts which will hang it forever. I‘m open to a PR or I’ll investigate this next week.

asabil commented 3 months ago

Added some printout in oidcc_http_util:request/4 and it actually looks like there is a "timer" problem:

[1720176242986], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration",[]} [{timeout,60000}] default
[1720176243470], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176258475], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration", []} [{timeout,60000}] default
[1720176258526], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176258647], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176273530], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration", []} [{timeout,60000}] default
[1720176273658], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176273806], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176273852], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288659], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration", []} [{timeout,60000}] default
[1720176288705], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288811], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288854], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288905], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default

So it looks like we somehow end up being blocked by auth0 because the number of jwks requests keep accumulating (1, 2, 3, 4...).

For reference, this is the print out code:

io:format("[~B], HTTP: ~p ~p ~p ~p~n", [
  os:system_time(millisecond),
  Method,
  Request,
  HttpOpts,
  HttpProfile
]),

maennchen commented 3 months ago

Hm, are they setting a strange cache header that leads to continuous refreshing?

asabil commented 3 months ago

Looks like this helps at least with the accumlation of timers:

diff --git a/src/oidcc_provider_configuration_worker.erl b/src/oidcc_provider_configuration_worker.erl
index d4539ec..a9cefa1 100644
--- a/src/oidcc_provider_configuration_worker.erl
+++ b/src/oidcc_provider_configuration_worker.erl
@@ -224,7 +224,8 @@ handle_continue(
 %% @private
 handle_info(backoff_retry, State) ->
     {noreply, State, {continue, load_configuration}};
-handle_info(configuration_expired, State) ->
+handle_info(configuration_expired, #state{jwks_refresh_timer = JwksRefreshTimer} = State) ->
+    maybe_cancel_timer(JwksRefreshTimer),
     {noreply, State#state{configuration_refresh_timer = undefined, jwks_refresh_timer = undefined},
         {continue, load_configuration}};
 handle_info(jwks_expired, State) ->

That being said, from my own experience, I think that it is better to use erlang:start_timer instead of timer:send_after as the former allows for matching against the currently active timer and avoid creating "parallel" timers such as in this case.

For example, with the current code I think it is still possible to have a race condition where the timer being canceled has already been fired and the message corresponding to the timeout is in the mailbox but not yet processed.

asabil commented 3 months ago

Hm, are they setting a strange cache header that leads to continuous refreshing?

Yes, 15 seconds:

{"cache-control",
  "public, max-age=15, stale-while-revalidate=15, stale-if-error=86400"},

maennchen commented 2 months ago

@asabil Sorry for the long wait.

I can't reproduce the issue. There should only be one single message triggered after the jwk reload is successful. You are seeing hundreds of messages in the queue.

Can you try to log each time handle_info or handle_continue is called together with the message_queue_len so that we can figure out what path those messages take so that they can accumulate?

asabil commented 2 months ago

Sorry for the delay, finally got to add some logs:

[ts: 1722866476154, qlen: 0] handle_continue(load_configuration)
[ts: 1722866476629, qlen: 0] handle_continue(load_jwks)

[ts: 1722866491632, qlen: 0] handle_info(configuration_expired)
[ts: 1722866491633, qlen: 0] handle_continue(load_configuration)
[ts: 1722866491662, qlen: 0] handle_continue(load_jwks)
[ts: 1722866491811, qlen: 0] handle_info(jwks_expired)
[ts: 1722866491811, qlen: 0] handle_continue(load_jwks)

[ts: 1722866506666, qlen: 0] handle_info(configuration_expired)
[ts: 1722866506667, qlen: 0] handle_continue(load_configuration)
[ts: 1722866506793, qlen: 1] handle_continue(load_jwks)
[ts: 1722866506918, qlen: 1] handle_info(jwks_expired)
[ts: 1722866506918, qlen: 1] handle_continue(load_jwks)
[ts: 1722866506945, qlen: 0] handle_info(jwks_expired)
[ts: 1722866506945, qlen: 0] handle_continue(load_jwks)

[ts: 1722866521798, qlen: 0] handle_info(configuration_expired)
[ts: 1722866521799, qlen: 0] handle_continue(load_configuration)
[ts: 1722866521888, qlen: 0] handle_continue(load_jwks)
[ts: 1722866521918, qlen: 0] handle_info(jwks_expired)
[ts: 1722866521919, qlen: 0] handle_continue(load_jwks)
[ts: 1722866521946, qlen: 0] handle_info(jwks_expired)
[ts: 1722866521946, qlen: 0] handle_continue(load_jwks)
[ts: 1722866522394, qlen: 0] handle_info(jwks_expired)
[ts: 1722866522394, qlen: 0] handle_continue(load_jwks)

[ts: 1722866536889, qlen: 0] handle_info(configuration_expired)
[ts: 1722866536890, qlen: 0] handle_continue(load_configuration)
[ts: 1722866537015, qlen: 2] handle_continue(load_jwks)
[ts: 1722866537050, qlen: 1] handle_info(jwks_expired)
[ts: 1722866537051, qlen: 1] handle_continue(load_jwks)
[ts: 1722866537080, qlen: 0] handle_info(jwks_expired)
[ts: 1722866537080, qlen: 0] handle_continue(load_jwks)
[ts: 1722866537398, qlen: 0] handle_info(jwks_expired)
[ts: 1722866537398, qlen: 0] handle_continue(load_jwks)
[ts: 1722866537510, qlen: 0] handle_info(jwks_expired)
[ts: 1722866537510, qlen: 0] handle_continue(load_jwks)

I added empty lines to group together the printouts. As you can see we somehow end up with accumulating timers that lead to excessive calls.

The qlen is 0, because we will only start to see that once auth0 (in my case) starts to apply rate limiting because of the excessive calls.

I still need to test the patch I posted above extensively, but it should solve the problem.

asabil commented 2 months ago

With the patch above applied, the logs look like this:

[ts: 1722867041017, qlen: 0] handle_continue(load_configuration)
[ts: 1722867041670, qlen: 0] handle_continue(load_jwks)

[ts: 1722867056676, qlen: 0] handle_info(configuration_expired)
[ts: 1722867056676, qlen: 0] handle_continue(load_configuration)
[ts: 1722867056704, qlen: 0] handle_continue(load_jwks)

[ts: 1722867071707, qlen: 0] handle_info(configuration_expired)
[ts: 1722867071708, qlen: 0] handle_continue(load_configuration)
[ts: 1722867071847, qlen: 0] handle_continue(load_jwks)

[ts: 1722867086849, qlen: 0] handle_info(configuration_expired)
[ts: 1722867086849, qlen: 0] handle_continue(load_configuration)
[ts: 1722867086887, qlen: 0] handle_continue(load_jwks)

maennchen commented 2 months ago

@asabil That looks promising. I’ll go ahead and merge the PR then. Thanks for debugging.

asabil commented 2 months ago

Which PR?

asabil commented 2 months ago

Just looked at the PR, that will not work. The fix is the one I posted in the comments above.

jwks_refresh_timer is set to undefined in handle_info, which will cause the maybe_cancel_timer to be a no-op in handle_continue.

maennchen commented 2 months ago

@asabil Oh, I misinterpreted your message. I'll revert the PR.

I believe you've pinpointed the exact issue though: The timer was cleared in the handle_info and so the timer cancellation can never work.

I'll revert and open a new PR to test.

maennchen commented 2 months ago

@asabil Does the new PR work for you?

erlef / oidcc