Random test failure / race condition (?) when running a lot of tests on CI

hubertlepicki commented 3 years ago

I have a large project that uses Mox for some light mocking here and there, and the test suites cases (it's an umbrella project) are setting up some stubs in setup block to be globally available in all tests. For example:

defmodule MyApp.TestCase do                                                                                                                                       
  use ExUnit.CaseTemplate                                                                                                                                      

  using do                                                                                                                                                     
    quote do                                                                                                                                                   
      ...                                                                                                                               
      import Mox                                                                                                                                               

      setup :set_mox_from_context                                                                                                                              
      setup :verify_on_exit!                                                                                                                                   
    end                                                                                                                                                        
  end                                                                                                                                                          

  setup tags do                                                                                                                                                
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(DB.Repo)                                                                                                          

    unless tags[:async] do                                                                                                                                     
      Ecto.Adapters.SQL.Sandbox.mode(DB.Repo, {:shared, self()})                                                                                               
    end                                                                                                                                                        

    Mox.stub_with(Analytics.TestImpl, Analytics.NoOpImpl)                                                                                                          

    :ok                                                                                                                                                        
  end                                                                                                                                                          
end

Each app in umbrella project has at least one of such files that set up ExUnit tests, often more, but they are all identical when it comes to setting up Mox: we :set_mox_from_context, :verify_on_exit! and then stub_with some modules.

The thing is, I am getting random test failures like this one, only on CI (GH Actions), never locally:

  1) test does test something here
Error:      apps/myapp/test/something/something_test.exs:1239
     ** (ArgumentError) cannot add expectations/stubs to Analytics.TestImpl in the current process (#PID<0.8021.0>) because Mox is in global mode and the global process is #PID<0.7992.0>. Only the process that set Mox to global can set expectations/stubs in global mode

     stacktrace:
       (mox 1.0.0) lib/mox.ex:590: Mox.add_expectation!/4
       (mox 1.0.0) lib/mox.ex:502: Mox.stub/3
       (mox 1.0.0) lib/mox.ex:553: anonymous fn/4 in Mox.stub_with/2
       (elixir 1.11.2) lib/enum.ex:2181: Enum."-reduce/3-lists^foldl/2-0-"/3
       (mox 1.0.0) lib/mox.ex:550: Mox.stub_with/2
       (emails 0.1.0) test/support/test_case.ex:24: MyApp.TestCase.__ex_unit_setup_0/1
       (emails 0.1.0) test/support/test_case.ex:1: MyApp.TestCase.__ex_unit__/2
       test/something/something_test.exs:1: MyApp.SomethingTest.__ex_unit__/2

The test failures seem random but always happen with tests that have async: false.

Out of 900 tests, I get like one or two failures with the above, maybe once every 10 times the suite runs on CI. I can't seem to reproduce it locally.

What I suspect is happening is some sort of race condition in Mox, where Mox.Server believes it's global_owner_pid still points to the previous test PID.

But this is weird as set_mox_from_context is added as a setup block, that I think should execute before the block that adds stubs, and in the current process.

Elixir 1.11.2 Erlang 23.1 Mox 1.0.0

Any ideas what may be wrong? I don't see anything wrong with Mox itself although I think it's likely there something is not quite right.

josevalim commented 3 years ago

I would need a mechanism to reproduce it so I can investigate. I know it may not be easy though. Something you can try is to run the suite with altered timings: elixir --erl "+T 9" -S mix test.

hubertlepicki commented 3 years ago

Alright @josevalim let me try this. Currently I can't even seem to reproduce it locally, only happens on CI which is obviously slower/less cores, that may be related issue too.

I didn't know about this flag either, thanks for the tip.

hubertlepicki commented 3 years ago

@josevalim I gave up for now, and just re-implemented the thing so we don't actually need to mock the module in every single test, and instead do dependency injection with a process that stores events on it's state, and match against that rather than use Mox.

I still think there mayb be some sort of race condition in Mox / ExUnit combo but I have no ways to even reproduce it myself.

Maybe I'll come back to this problem as a weekend project at some point but also maybe not. It's a dreadful task, but may be bothering me. Closing for now and thanks for your help

23Skidoo commented 3 years ago

This issue still exists, we're bumping into it in our project.

josevalim commented 3 years ago

We still need a way to reproduce it though so we can take a look. :)

schrockwell commented 3 years ago

I am also unable to reliably reproduce this issue, it mostly shows up in GitHub Actions CI. But here is my theory.

In our tests, we use Mox.stub_with/2 in the setup block of every test, meaning that mock expectations are defined very early-on in the test lifecycle. The tests are not asynchronous, and some of them put Mox into global mode.

Test A calls Mox.set_mox_global/1, which flags that test process as the owner of Mox global mode. The singleton Mox.Server process is now monitoring this test process via Process.monitor, so that it can later handle a :DOWN message, which will set Mox back to private mode when that test process ends.
Test A completes and its process ends, but for some reason the :DOWN message to Mox.Server is delayed, which means that Mox is currently still in global mode.
Test B begins, and it is expecting Mox to be in private mode. It attempts to set up a stub with Mox.stub_with/2, but Mox.Server is still in global mode with the Test A's process as the owner, so it explodes.

The hacky(?) fix

Call Mox.set_mox_private/1 in the very beginning of the setup block of the base test case, to ensure that the state of Mox.Server is settled before trying to set any expectations.

josevalim commented 3 years ago

@schrockwell I think you got jackpot. I believe there is no guarantee all DOWN messages are delivered at the same time. I have pushed a fix based on your assumption, please try master out!

Xunjin commented 3 years ago

I am also unable to reliably reproduce this issue, it mostly shows up in GitHub Actions CI. But here is my theory.

In our tests, we use Mox.stub_with/2 in the setup block of every test, meaning that mock expectations are defined very early-on in the test lifecycle. The tests are not asynchronous, and some of them put Mox into global mode.
1. Test A calls `Mox.set_mox_global/1`, which flags that test process as the owner of Mox global mode. The singleton `Mox.Server` process is now monitoring this test process via `Process.monitor`, so that it can later handle a `:DOWN` message, which will set Mox back to private mode when that test process ends.

2. Test A completes and its process ends, but for some reason the `:DOWN` message to `Mox.Server` is delayed, which means that Mox is currently still in global mode.

3. Test B begins, and it is expecting Mox to be in private mode. It attempts to set up a stub with `Mox.stub_with/2`, but `Mox.Server` is still in global mode with the Test A's process as the owner, so it explodes.
The hacky(?) fix

Call Mox.set_mox_private/1 in the very beginning of the setup block of the base test case, to ensure that the state of Mox.Server is settled before trying to set any expectations.

Thank you @schrockwell your "hacky" approach literally solved our problem in a macro which calls stubs_with to every modules in our CI tests

@josevalim I tried to replicate the problems as you suggested elixir --erl "+T 9" -S mix test but couldn't reproduce. However when I stressed my CPU enough (using stress --cpu 16 I have a 8 cores CPU with 16 threads) which before running them makes the CPU really high (mine was about 100% :sweat_smile:) it did happen. Being speculative here, depending on how the container is provisioned in GH Actions, the CPU constraint can happen and affect async testing. That's why @schrockwell approach solved our problem.

Well, being on point, I will try your commit in master seeing whether this solves the problem. Tho maybe our approach on how to define this macro for testing is wrong. Sorry starting in Elixir this week and have a lot to learn :sweat_smile:

23Skidoo commented 3 years ago

We've been using the HEAD version of Mox for a week or so and haven't seen this error since the upgrade. Thank you @schrockwell and @josevalim!

josevalim commented 3 years ago

v1.0.1 released.

dashbitco / mox

Random test failure / race condition (?) when running a lot of tests on CI #107

The hacky(?) fix

The hacky(?) fix