32-bit instability with high thread counts / std_alloc: Cannot allocate 512 bytes of memory (of type "bpd")

zv-io commented 3 months ago

Describe the bug

On 32-bit systems with high thread counts (N\~120), transient instability in the Elixir test suite surfaces. At even higher thread counts (N\~160), fatal errors are reliably reproduced.

Transient instability at N\~120 includes assertion failures and timeouts, e.g.:

  1) test undefined global no_warn_undefined :all (Module.Types.IntegrationTest)                                                        
     test/elixir/module/types/integration_test.exs:562                                                                                  
     Assertion with == failed                                       
     code:  assert capture_compile_warnings(files) == ""                                                                                                                                                                                                                        
     left:  "    warning: redefining module A (current version defined in memory)\n    │\n  1 │ defmodule A do\n    │ ~~~~~~~~~~~~~~\n    │\n    └─ a.ex:1: A (module)\n\n"                                                                                                     
     right: ""                                                      
     stacktrace:                                                                                                                                                                                                                                                                
       test/elixir/module/types/integration_test.exs:576: (test)

and

==> mix (ex_unit)                                                   
Running ExUnit with seed: 406740, max_cases: 224                                                                                                                                                                                                                                
Excluding tags: [windows: true]                                     

...                                                                    
  1) test runs in daemon mode (Mix.Tasks.ReleaseTest)               
     /usr/src/packages/user/elixir/src/elixir-1.17.2/lib/mix/test/mix/tasks/release_test.exs:716                                        
     ** (ExUnit.TimeoutError) test timed out after 60000ms. You can change the timeout:                                                                                                                                                                                         

       1. per test by setting "@tag timeout: x" (accepts :infinity)                                                                                                                                                                                                                    2. per test module by setting "@moduletag timeout: x" (accepts :infinity)                                                        
       3. globally via "ExUnit.start(timeout: x)" configuration     
       4. by running "mix test --timeout x" which sets timeout      
       5. or by running "mix test --trace" which sets timeout to infinity                                                               
          (useful when using IEx.pry/0)                             

     where "x" is the timeout given as integer in milliseconds (defaults to 60_000).                                                    

     stacktrace:                                                    
       (elixir 1.17.2) lib/process.ex:303: Process.sleep/1          
       test/mix/tasks/release_test.exs:807: Mix.Tasks.ReleaseTest.wait_until/1                                                          
       test/mix/tasks/release_test.exs:743: anonymous fn/1 in Mix.Tasks.ReleaseTest."test runs in daemon mode"/1                        
       (mix 1.17.2) lib/mix/project.ex:463: Mix.Project.in_project/4                                                                    
       (elixir 1.17.2) lib/file.ex:1665: File.cd!/2                 
       test/test_helper.exs:156: MixTest.Case.in_fixture/3          
       test/mix/tasks/release_test.exs:717: (test)                  
       (ex_unit 1.17.2) lib/ex_unit/runner.ex:485: ExUnit.Runner.exec_test/2                                                            
       (stdlib 6.0) timer.erl:590: :timer.tc/2                      
       (ex_unit 1.17.2) lib/ex_unit/runner.ex:407: anonymous fn/6 in ExUnit.Runner.spawn_test_monitor/4                                 

    warning: redefining module ReleaseTest.MixProject (current version defined in memory)                                               
    │                                                               
  1 │ defmodule ReleaseTest.MixProject do                           
    │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                           
    │                                                               
    └─ /usr/src/packages/user/elixir/src/elixir-1.17.2/lib/mix/tmp/Mix.Tasks.ReleaseTest/runs_eval_and_version_commands/mix.exs:1: ReleaseTest.MixProject (module)

however at N\~160, fatal errors are reliably reproduced:

std_alloc: Cannot allocate 512 bytes of memory (of type "bpd")

at N\~192, the OTP test suite shows some transient (not always reproducible) errors:

CWD set to: "/usr/src/packages/user/erlang/src/otp-OTP-27.0/release/tests/test_server/ct_run.test_server@adelie.2024-08-09_22.39.04"

TEST INFO: 5 test(s), 25 case(s) in 5 suite(s)

Testing tests.emulator_test.smoke_test_SUITE: Starting test, 3 test cases
Testing tests.emulator_test.smoke_test_SUITE: TEST COMPLETE, 3 ok, 0 failed of 3 test cases

Testing tests.emulator_test.time_SUITE: Starting test, 18 test cases

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
peer:start_it failed on line 929
Reason: timeout
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Error detected: [timeout,{end_per_testcase,{...}}]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Error detected: [timeout,{end_per_testcase,{...}}]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Testing tests.emulator_test.time_SUITE: *** FAILED test case 12 of 18 ***

We have a couple of systems with high thread counts, namely 20- and 24- core SMT=8 ppc64 systems. We can vary the number of active cores/threads on these machines on demand, from 1-192 threads. For example:

# ppc64_cpu --cores-on=24 && ppc64_cpu --smt=8 && nproc
192

# ppc64_cpu --cores-on=20 && ppc64_cpu --smt=8 && nproc
160

# ppc64_cpu --cores-on=19 && ppc64_cpu --smt=7 && nproc
133

During recent testing in our 32-bit PowerPC environments (this does not affect 64-bit environments), we discovered significant instability (transient test failures) at N\~120 as well as reliably repeatable fatal errors at N\~160, that are mitigated when lower thread configurations are used. Our highest thread x86_64 machine (running an i586 environment) caps out at 72 and is stable.

All tests are run with the exact same binaries and hardware; the only thing that changes is how many cores/threads are active.

Some examples:

Erlang/OTP 27 [erts-15.0] [source] [32-bit] [smp:64:64] [ds:64:64:10] [async-threads:1]

Elixir 1.17.2 (compiled with Erlang/OTP 27)

Erlang/OTP 27 [erts-15.0] [source] [32-bit] [smp:144:96] [ds:144:96:10] [async-threads:1]

Elixir 1.17.2 (compiled with Erlang/OTP 27)

Erlang/OTP 27 [erts-15.0] [source] [32-bit] [smp:168:112] [ds:168:112:10] [async-threads:1]

Elixir 1.17.2 (compiled with Erlang/OTP 27)

Erlang/OTP 27 [erts-15.0] [source] [32-bit] [smp:192:128] [ds:192:128:10] [async-threads:1]

Elixir 1.17.2 (compiled with Erlang/OTP 27)

To Reproduce

Have a 32-bit system with sufficiently many threads (we can provide access to one).
Run the Elixir test suite (or come up with a way to exercise Erlang directly).
Observe the transient warnings and/or fatal errors depending on thread count.

Expected behavior

The workload is handled successfully.

Affected versions

OTP 27.

Additional context

https://github.com/elixir-lang/elixir/issues/13774 (reported to Elixir, as this is where we observed the issue)
https://git.adelielinux.org/adelie/packages/-/issues/1194 (downstream issue)

Additional remarks

As N increases, it seems (anecdotally) that the performance of the Erlang/OTP and Elixir test suites become slower. We have not collected experimental data on this.
512 bytes by 128 threads would approximate a 16-bit memory ceiling of sorts. This is just a theory.

jhogberg commented 3 months ago

Thanks for your report! These symptoms are consistent with running out of virtual memory space. Scheduler threads eat up a considerable amount of that (double the number to account for their dirty sibling), and many test cases also scale according to the number of schedulers.

If the symptoms lessen if you scale down the number of dirty schedulers (+SDcpu 1 off the top of my head), I’d chalk it up to running out of virtual memory.

josevalim commented 3 months ago

I wonder if it only happens on Elixir test suite because Elixir runs tests in parallel by default and it uses the number of cores multiplied by 2. So it will try to run 240 tests at once. This will reveal race conditions in the suite that didn't show up otherwise (like two tests trying to define the same module, which is a bug in the test suite). But this will also be problematic for tests that shell out. Some of our tests spawns new Erlang VM processes too, which would explain running out of virtual memory (especially if you are spawning dozens of then in parallel). @zv-io you can try TRACE=1 make test. That should disable concurrent testing. The module conflict I will fix on Elixir.

zv-io commented 3 months ago

@josevalim I amend my previous statement about Elixir passing tests. We had been running a smaller subset of tests for convenience, and these do show errors at higher thread counts but weren't aborting the test script on failure:

export ERL_TOP=$builddir

make release_tests

for _header in erl_fixed_size_int_types.h \
${CHOST}/erl_int_sizes_config.h; do
cp erts/include/$_header erts/emulator/beam/
done
cd release/tests/test_server
$ERL_TOP/bin/erl -s ts install -s ts smoke_test batch -s init stop

With TRACE=1 make test (N=192 threads enabled on the machine, but no -j provided to make) the visible output appears to be single-threaded, however all 192 threads are busy doing something ("running") -- see how high the load is:

adelie # uptime
 22:44:38 up  2:38,  2 users,  load average: 824.14, 792.87, 767.28

Is this the expected behavior on such a system?

Here are the full logs from TRACE=1 make test, all the same hardware, same binaries, etc.:

N=192: erlang-ppc-musl-192t.log -- I killed it after 6+ hours when it reached Testing tests.emulator_test: *** FAILED test case 1710 of 2194 ***, and I had to killall -9 epmd; killall -9 beam.smp as those processes kept going...
N=96: erlang-ppc-musl-96t.zip -- I killed it after 6+ hours, and the log file is much larger with more/different errors.
N=64: (pending, will update shortly)

I am happy to provide access to this hardware (a spare test machine set up to debug this issue) if that would be helpful.

josevalim commented 3 months ago

@zv-io can you reach me by email on my GitHub profile? Access to the machine would help find out what is going on. Although I will probably need help setting it all up. Thank you!

michalmuskala commented 3 months ago

High CPU usage on systems with many schedulers is an issue we've encountered and are actively investigating - there seem to be some inefficiencies in the scheduling and task stealing algorithms that become visible with 100+ schedulers. This is on 64bit systems, but I would imagine some of that would translate to 32bit systems as well. That said, we don't see any hanging or crashes - just elevated CPU usage.

cc @RobinMorisset

rickard-green commented 3 months ago

As @jhogberg already said. You are very likely running out of virtual address space.

When I'm starting a 32-bit runtime system (which does nothing besides starting) with 160 schedulers and 160 dirty CPU schedulers on my linux machine, it will map about 750 MB of virtual memory which is quite a lot of the maximum amount of 4 GB available to a 32-bit VM (if using 8 schedulers and 8 dirty CPU schedulers it will map about 100 MB of virtual memory). The high amount of virtual memory mapped comes both from thread stacks (a normal scheduler will reserve about 1/2 MB by default), but also from the memory mappings made by ERTS internal scheduler specific allocator instances. The ERTS internal allocators can be disabled, but that will likely cost you much when it comes to performance/scalability and will perhaps not even reduce memory usage if you truly are using all 160 schedulers.

When using this many schedulers, you typically don't want to use the 32-bit runtime system, but instead use the 64-bit runtime system which does not suffer from the very limited amount of virtual address space available to the 32-bit runtime system. It is possible to shrink stack sizes on schedulers and other threads, and disable ERTS internal allocators to get the system to survive a bit longer, but I'd say that you are just shooting yourself in the foot. I see only two real options; either use the 64-bit runtime system, or reduce the memory usage of your application.

When looking at the erl_crash-calls.dump in the https://git.adelielinux.org/adelie/packages/-/issues/1194 issue that you pointed to, one can see that the runtime system has allocated this much memory:

total: 3380754648
processes: 10167892
processes_used: 10166320
system: 3370586756
atom: 1056953
atom_used: 1039189
binary: 339784
code: 13783237
ets: 620248

I'd say that it is more or less amazing that it was able to allocate and use 3.15 GB of memory with 160 schedulers and 160 dirty CPU schedulers in a 32-bit runtime system. Note that the 3.15 GB are dynamically allocated memory provided by the allocators, so they will need to map even more address space than that in order to be able to provide those allocations. This also do not include stacks of threads, dynamically linked libraries, and code of the runtime system.

jhogberg commented 3 months ago

While quickly hitting the 4GB barrier with this many schedulers is hardly unusual, I find it a bit odd that system allocations dwarf the rest to such a degree. I think it's worth investigating further.

zv-io commented 3 months ago

@jhogberg thanks for your support here; do you need access to a test machine?

jhogberg commented 3 months ago

That would be much appreciated, however, I most likely won't be able to look at this until mid-September due to other commitments. :-|

zv-io commented 5 days ago

That would be much appreciated, however, I most likely won't be able to look at this until mid-September due to other commitments. :-|

@jhogberg could I poke you about this please? Not urgent but if you do have some spare time to look at this we'd be grateful.

jhogberg commented 4 days ago

Sure, please send the details to my e-mail, you'll find it in the commit log.

erlang / otp

32-bit instability with high thread counts / std_alloc: Cannot allocate 512 bytes of memory (of type "bpd") #8711