elixir-lang / elixir

Elixir is a dynamic, functional language for building scalable and maintainable applications
https://elixir-lang.org/
Apache License 2.0
24.37k stars 3.36k forks source link

Elixir 1.2: Logger crash when using `inspect` in production with protocols consolidated #4518

Closed SimonWoolf closed 8 years ago

SimonWoolf commented 8 years ago

Wasn't sure whether to report this as not certain it's an elixir/logger bug rather than something I'm doing wrong (tried slack and SO first), but on I figure logger crashing seems likely to be a bug that I should report even if I am doing something odd. If it isn't a bug, many apologies.

So: the first inspect call that happens in production causes a logger crash, claiming that eg Elixir.Inspect.List.inspect is undefined:

=INFO REPORT==== 13-Apr-2016::09:26:01 ===
    application: logger
    exited: shutdown
    type: permanent

=ERROR REPORT==== 13-Apr-2016::09:26:01 ===
Error in process <0.438.0> with exit value:
{[{reason,undef},
  {mfa,{'Elixir.AppName.Web.CatchAll',handle,2}},
  {stacktrace,
      [{'Elixir.Inspect.List',inspect,
           [[],
            #{'__struct__' => 'Elixir.Inspect.Opts',
              base => decimal,
              binaries => infer,
              char_lists => infer,
              limit => 50,
              pretty => false,
              safe => true,
              structs => true,
              width => 80}],
           []},
       {'Elixir.Kernel',inspect,2,[{file,"lib/kernel.ex"},{line,1566}]},
       {'Elixir.AppName.Web.CatchAll',handle,2,
           [{file,"lib/web/catch_all.ex"},{line,11}]},
       {cowboy_handler,handler_handle,4,
           [{file,"src/cowboy_handler.erl"},{line,111}]},
       {cowboy_protocol,execute,4,
           [{file,"src/cowboy_protocol.erl"},{line,442}]}]},
  {req,
      [{socket,#Port<0.8619>},
       {transport,ranch_tcp},
       {connection,keepalive},
       {pid,<0.438.0>},
       {method,<<"GET">>},
       {version,'HTTP/1.1'},
       {peer,{{172,17,0,1},40454}},
       {host,<<"172.17.0.3">>},
       {host_info,undefined},
       {port,6100},
       {path,<<"/badpagr">>},
       {path_info,[<<"badpagr">>]},
       {qs,<<>>},
       {qs_vals,undefined},
       {bindings,[]},
       {headers,
           [{<<"user-agent">>,<<"curl/7.35.0">>},
            {<<"host">>,<<"172.17.0.3:6100">>},
            {<<"accept">>,<<"*/*">>}]},
       {p_headers,[]},
       {cookies,undefined},
       {meta,[]},
       {body_state,waiting},
       {buffer,<<>>},
       {multipart,undefined},
       {resp_compress,false},
       {resp_state,waiting},
       {resp_headers,[]},
       {resp_body,<<>>},
       {onresponse,undefined}]},
  {state,nil}],
 [{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]}

The code that triggers it is just

Logger.warn "Unknown path: #{:cowboy_req.path(request) |> elem(0)},
  params: #{:cowboy_req.qs_vals(request) |> elem(0) |> inspect}"

Here, :cowboy_req.qs_vals(request) |> elem(0) is just [], so the line that crashes is equivalent to inspect([]).

Elixir version: 1.2.2 Erlang version: 18.2.1 Presumably this is related to Elixir 1.2 consolidating the inspect protocol by default? I'm using build_embedded: Mix.env == :prod, so protocols are being consolidated. (and eg _build/prod/consolidated/Elixir.Inspect.beam is present in the slug).

We're using the heroku elixir buildpack to make slugs, which are then run using docker.

The weird thing is, this crash only happens in production. The same app running locally -- even when using MIX_ENV=prod, running from the same slug that runs in production -- does not have any problems.

Let me know what other details I can provide to help.

Thanks in advance :)

josevalim commented 8 years ago

It is failing because it thinks the function Elixir.Inspect.List.inspect/2 is not available. I would look for the following things:

  1. See if Elixir.Inspect.List is present inside Elixir in the slug (or check if it is available from the console)
  2. See if you are not accidentally redefining it
  3. Maybe it is a failed build issue. Try pushing to heroku with the always_rebuild configuration set to true. You can find it here: https://github.com/HashNuke/heroku-buildpack-elixir
SimonWoolf commented 8 years ago

Hi Jose! Thanks for the quick reply :)

See if Elixir.Inspect.List is present inside Elixir in the slug (or check if it is available from the console)

I'm a little confused as to how it can not be, but you're right, it isn't. IO.puts :erlang.function_exported(Elixir.Inspect.List, :inspect, 2) from the app prints true when running the app locally, but false in production.

Not sure how else to check if it's 'present inside Elixir'? Grepping the _build directory for Elixir.Inspect.List produces "Binary file Elixir.Inspect.beam matches", if that helps.

See if you are not accidentally redefining it

I'm not doing anything that redefines any Elixir builtins, so unless I'm missing something, I'm afraid I don't see how I could be. (And as I said, inspect works fine when running the app locally, even with compiling and running with MIX_ENV=prod)

Maybe it is a failed build issue.

I've added always_rebuild=true to the elixir_buildpack.config, but the issue remains. (to be clear I'm not actually using heroku itself, just using the buildpack to build a slug, which is then run with mix run --no-halt by docker).

josevalim commented 8 years ago

I'm a little confused as to how it can not be, but you're right, it isn't. IO.puts :erlang.function_exported(Elixir.Inspect.List, :inspect, 2) from the app prints true when running the app locally, but false in production.

This seems to be the root cause. What does Code.ensure_loaded?(Elixir.Inspect.List) return?

Not sure how else to check if it's 'present inside Elixir'?

Elixir.Inspect.List is provided as part of Elixir. I am assuming somewhere in your slug, the Elixir project was downloaded and compiled. I am assuming that Elixir.Inspect.List.beam is either missing from the elixir project (not _build) in your slug or it somehow has a bogus version.

SimonWoolf commented 8 years ago

This seems to be the root cause. What does Code.ensure_loaded?(Elixir.Inspect.List) return?

After some experimentation -- it seems to be random :(

Trying the non-boolean version: I obviously can't IO.inspect the error tuple returned by Code.ensure_loaded if it fails, but since all elements are atoms I can extract and IO.puts them, and: the error returned is {:error, :nofile}.

To be clear, this isn't just nondeterminism between builds, it's between deploys of a single build. I.e. I can deploy the exact same slug (same SHA) several times, and sometimes ensure_loaded will fail with :nofile, and sometimes it'll work.

josevalim commented 8 years ago

That's quite surprising.

I can deploy the exact same slug (same SHA) several times

When you mention "deploy" the same slug, does it compile the code every time you deploy or you are referring to the exact same compiled artifacts? Would you be able to retrieve the .beam file for cases where it works and where it fails and upload somewhere so we can analyze them?

Thank you for following up.

SimonWoolf commented 8 years ago

When you mention "deploy" the same slug, does it compile the code every time you deploy or you are referring to the exact same compiled artifacts? Would you be able to retrieve the .beam file for cases where it works and where it fails and upload somewhere so we can analyze them?

Sorry for ambiguity - I mean literally the exact same compiled artifacts. heroku-elixir-buildpack spits out a tarball with everything compiled, that's what I'm referring to by the 'slug'. 'deploying the slug' is just starting a docker container with the slug extracted to /app, and running this script (passing in mix run --no-halt --no-deps-check).

I've put the beam file from a slug up here in case it helps. Since it does sometimes work, I'm guessing everything is in there that should be else it couldn't ever work. :/

Thank you for following up.

No, thank you for helping! I realise this must be frustrating to debug remotely.

SimonWoolf commented 8 years ago

After some more experimentation, turns out this bug only ever happens if the very first call to inspect for some datatype is done from a cowboy handler process.

In other words, the issue completely disappears if I do something like Logger.debug "Some data structures: #{inspect {%{}, [], MapSet.new, 0.1, 1..2, ~r//, :a, <<1>>}}" from one of my apps' application callback functions (on start).

(I've also made sure it wasn't an issue with one of my other deps - I can reproduce the issue with a dummy app with no external deps except cowboy (1.0.0), and no other code except an app callback that starts cowboy and one http handler that does an IO.inspect).

The workaround solves the pressing issue for us, but of course I'm happy to keep trying any experiments you want me to to track the underlying bug down.

josevalim commented 8 years ago

This is really helpful. It seems to be related to Erlang's code loading then. When building releases, we should not have such problems, as all modules are preloaded upfront. Great job and thank you!

josevalim commented 8 years ago

I will leave this open for now if I get a chance to investigate it later, specially to find out why it happens only with inspect.

SimonWoolf commented 8 years ago

We think we know why this was happening now, and it was our fault. Not a bug in elixir or erlang. Sorry for the invalid bug report :(

If you're interested:

Since the things in /tmp are overlayfs upper directories, this cleanup script didn't affect the underlying slug, which was still correct. But it did mark files as 'deleted' in overlayfs, so they were hidden from the docker container.

Result: random sadness, depending on a combination of when each .beam file was last modified, and which files had already been loaded by the erlang vm by the time the cleanup script ran. (So I guess my workaround only worked because it meant those files were loaded early on - and presumably most other protocols were either exercised early on or else not at all, so the problem only showed up with inspect).

We've now moved the overlayfs upperdirs out of /tmp, and everything is working fine. (Why this hasn't shown up as an issue for us before, who knows..)

Thank you for your help earlier, and apologies again!

josevalim commented 8 years ago

I am glad this was fixed! :heart:

doublej74 commented 2 years ago

Has this problem be reintroduced in 1.13.0 ? My application crashes on inspect in production