erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.33k stars 2.94k forks source link

Crash in code_server.erl (badmatch: error) #8510

Open robashton opened 4 months ago

robashton commented 4 months ago

Describe the bug In certain scenarios (Our massive Purescript codebase compiling into Erlang), possibly pertaining to short lived processes coupled with modules with larger than normal load times but I'm not sure, we get a bad match thrown inside code_server. The place where this reproduces most is in our Cowboy use, if we make some small requests just after start-up, we get the error with the module cowboy_static@ps (which nearly everything references). The file itself is nothing remarkable and I don't think is related to the issue.

If I preload the module with a manual load this does not happen.

I think this relates to the changes made in #7503 and #6736

To Reproduce I can't easily build an isolated example

Expected behavior The code should load and not crash code_server

Affected versions 26+

Additional context This bit I can help with, I've pulled down latest and built, the problem exists on line 1184 of code_server.erl, we do a maps:take on the loading collection and the module doesn't exist in there.

A trace of the messages to and from the code server pid reveals the following order of events

4672: <9432.50.0> send: {code_server,{{binary_data,3620},
              "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/media-shell_deps/norsk/e
bin/cowboy_static@ps.beam",
              #Ref<0.167300051.2764308486.137148>}} -> <0.1045.0>
4672: <9432.50.0> 'receive': {code_call,<0.1045.0>,
           {load_module,#Ref<0.167300051.2764439556.148827>,cowboy_static@ps,
                        "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/media-shell_de
ps/norsk/ebin/cowboy_static@ps.beam",
                        false,#Ref<0.167300051.2764308486.137148>}}
4673: <9432.50.0> 'receive': {{'LOADER_DOWN',{cowboy_static@ps,{binary_data,3620},
                                  "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/medi
a-shell_deps/norsk/ebin/cowboy_static@ps.beam"}},
 #Ref<0.167300051.2764308486.137148>,process,<0.1045.0>,shutdown}
4673: <9432.50.0> spawn: <0.1047.0>, {erlang,apply,
                    [{"fun code_server:-handle_on_load/5-fun-0-/0",
                      [cowboy_static@ps]},
                     []]}
4673: <9432.50.0> send_to_non_existing_process: {code_server,{{binary_data,3620},
              "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/media-shell_deps/norsk/e
bin/cowboy_static@ps.beam",
              #Ref<0.167300051.2764308481.176919>}}, <0.1046.0>
4673: <9432.50.0> 'receive': {{'LOADER_DOWN',{cowboy_static@ps,{binary_data,3620},
                                  "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/medi
a-shell_deps/norsk/ebin/cowboy_static@ps.beam"}},
 #Ref<0.167300051.2764308481.176919>,process,<0.1046.0>,noproc}
4673: <9432.50.0> 'receive': {'DOWN',#Ref<0.167300051.2764308481.176918>,process,<0.1047.0>,ok}
4673: <9432.50.0> send: {finish_after_on_load,{cowboy_static@ps,true},
                      <0.50.0>,#Ref<0.167300051.2764308481.176922>} -> erts_code_purger
4675: <9432.50.0> 'receive': {reply,finish_after_on_load,ok,#Ref<0.167300051.2764308481.176922>}
4675: <9432.50.0> 'receive': timeout

Essentially we get some LOADER_DOWN message, remove the module from the map, and then try to read the module from the map.

Sticking a case statement around the maps:take "fixes" this, but I don't know if that is what is needed, or if it's just a band-aid - not fully understanding the end to end flow of this code and not having sat down with pen and paper to work it out yet. If this is obvious to somebody else then I'm happy for then to make the fix, or tell me what the fix is and I'll do it.

Cheers,

Rob

josevalim commented 4 months ago

I believe I can see why this bug happens.

  1. We store the module as loading here:

https://github.com/erlang/otp/blob/9f778f549262dc2d2a08d06714de5b38745e914f/lib/kernel/src/code_server.erl#L353

  1. Then, to reply success, we schedule it here (I say schedule because it is postponed by handle_pending_on_load):

https://github.com/erlang/otp/blob/9f778f549262dc2d2a08d06714de5b38745e914f/lib/kernel/src/code_server.erl#L1102-L1110

  1. However, if we receive a loader_down, we immediately delete it:

https://github.com/erlang/otp/blob/9f778f549262dc2d2a08d06714de5b38745e914f/lib/kernel/src/code_server.erl#L186

Then, when we process the pending on load queue, the loader down message has already been processed.

I believe the fix is to put both lines below around a handle_pending_on_load:

https://github.com/erlang/otp/blob/9f778f549262dc2d2a08d06714de5b38745e914f/lib/kernel/src/code_server.erl#L186 https://github.com/erlang/otp/blob/9f778f549262dc2d2a08d06714de5b38745e914f/lib/kernel/src/code_server.erl#L317

Otherwise we may process those events out of order. cc @michalmuskala

josevalim commented 4 months ago

Actually, we pretty much have two different locking mechanisms for loading a module, one for loading on the client, the other one is for handling on_load callbacks. We should unify them into one.

josevalim commented 4 months ago

I am looking into this still but I am busy with the upcoming Elixir v1.17 release, sorry for the delay.

@robashton, a temporary solution would most likely be to preload any modules with -on_load callbacks in them. For example, you can do it in your application:start/2 callback.

robashton commented 4 months ago

We are just running a forked OTP for now that pretends the error doesn't exist, we aren't blocked so no worries!

axelson commented 3 weeks ago

Are there any workarounds for this in the meantime? It seems like we'd have to go back pretty far to avoid the changes from https://github.com/erlang/otp/pull/7503 and https://github.com/erlang/otp/pull/6736

josevalim commented 3 weeks ago

This happens on modules with on_load callbacks on them, so you can try preloading them before (for example, in your helper). We also haven't hard reports of this happening earlier than OTP 27. More concurrency was added on 27, which would explain why we suddenly started seeing it.

axelson commented 2 weeks ago

Hmm, we're currently using OTP 26 so I guess you have a report of it now :smile:

I did find some modules in our dependencies that have on_load callbacks (via cd deps; rg on_load), I've now added loading of those at the bottom of our test_helper.exs:

Code.ensure_loaded(JsonXema)
Code.ensure_loaded(Bcrypt.Base)
Code.ensure_loaded(Jaxon.Parsers.NifParser)

I've also removed some usages of Patch which loads modules at runtime and now the frequency appears to have gone way down (although it is still happening occasionally).

josevalim commented 2 weeks ago

Good to know! If Patch is relying on_load callbacks, then that's likely the root cause. I'd recommend opening up an issue, because, as far as I know, on_load callback is not recommended to be used for anything else besides loading NIFs.