Performance OTP27 on binary_to_term slower

ThomasArts commented 3 months ago

Describe the bug It seems that binary_to_term got slower.

To Reproduce I used the following simple way to generate a rather large term and then N copies of that term as a binary in a list. On each of the copies we run binary_to_term. Expectation is that it is equally fast as for OTP-26.1.2... but it isn't.

On OTP26, roughly:

Erlang/OTP 26 [erts-14.2.1] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Eshell V14.2.1 (press Ctrl+G to abort, type help(). for help)
3> [ppp:test(100) || _ <- lists:seq(1, 10)], ok.
Took 108.98 ms
Took 80.54 ms
Took 80.28 ms
Took 80.18 ms
...
ok
4> [ppp:test(1000) || _ <- lists:seq(1, 10)], ok.
Took 921.97 ms
Took 908.09 ms
Took 907.46 ms
...
ok
5> [ppp:test(10000) || _ <- lists:seq(1, 10)], ok.
Took 12371.10 ms
Took 12916.74 ms
Took 12741.55 ms
...

But on the latest master: source-8504d0e0b8 the 10k test gets much slower!

Erlang/OTP 27 [RELEASE CANDIDATE 2] [erts-14.2.3] [source-8504d0e0b8] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Eshell V14.2.3 (press Ctrl+G to abort, type help(). for help)
1> [ppp:test(100) || _ <- lists:seq(1, 10)], ok.
Took 104.34 ms
Took 96.72 ms
Took 97.04 ms
...
ok
2> [ppp:test(1000) || _ <- lists:seq(1, 10)], ok.
Took 1083.05 ms
Took 1079.13 ms
Took 1075.11 ms
...
ok
3> [ppp:test(10000) || _ <- lists:seq(1, 10)], ok.
Took 14786.23 ms
Took 14768.29 ms
Took 15070.95 ms
...

Test program

-module(ppp).
-compile([export_all, nowarn_export_all]).

test(N) ->
  erlang:garbage_collect(),
  Terms = terms(10, 3),
  %% N copies of the same binary
  Bins = lists:duplicate(N, term_to_binary(Terms)),
  {T, _} = timer:tc(fun() -> b2t(Bins) end),
  io:format("Took ~.2f ms\n", [T / 1000]).

terms(L, 0) ->
  [ {I, crypto:strong_rand_bytes(8)} || I <- lists:seq(1, L)];
terms(L, Depth) ->
  [ terms(L, Depth-1) || _ <- lists:seq(1, L) ].

b2t([]) ->
  ok;
b2t([B | Bs]) ->
  T = binary_to_term(B),
  [T | b2t(Bs)].

Expected behavior Equal speeds are expected

Affected versions This seems introduced in commit: 24ef4cbaed for OTP27. The last commit with faster times is: 49024e83a2

Additional context This is observed when testing riak and may be related to issue: https://github.com/erlang/otp/issues/8229

jhogberg commented 3 months ago

Thanks, I have a rough idea of what might have caused this. I'll have a deeper look at this later in the week. :-)

jhogberg commented 3 months ago

I can reproduce a slight slowdown with https://github.com/erlang/otp/commit/24ef4cbaeda9b9c26682cba75f2f15b0c58722aa, but I'm really puzzled to see it on https://github.com/erlang/otp/commit/8504d0e0b84f57950c94bf6244a9699102893b7d as https://github.com/erlang/otp/commit/3fcd74cdbb094de048623b0da5a3aac0e0b7335c fixes it completely for me.

What platform is this? Have you turned off all kinds of CPU frequency scaling?

jhogberg commented 3 months ago

I've opened #8347 that should improve things a bit, I'm seeing slightly better results locally but I'm still a bit puzzled as to why it was so much slower on your machine.

mkuratczyk commented 3 months ago

I've run Thomas's test on my machines:

Linux, i9-10900X CPU @ 3.70GHz	26.2.3	master	pr8347
100	139	143	123
1000	1580	1615	1780
10000	18500	18160	17250

M1 Pro, macOS 14.4.1	26.2.2	master	pr8347
100	170	174	173
1000	2240	2330	2340
10000	29400	30500	30400

ThomasArts commented 3 months ago

Just to be sure I did run test again on Mac M2 Pro, Ventura 13.6 with results comparable to earlier run:

M2 Pro, macOS 13.6 (32 GB)	26.2.1	26.2.2	master (8504d0e0b8)	pr8347 (436568a31bf)
100	80	81	95	95
1000	920	930	1050	1060
10000	13400	13400	15700	15200

I'll try to build pr8347? as well to see if I can relate to the above numbers. It might of course, be something M2 specific. I built from source.

mkuratczyk commented 3 months ago

This is tangental to the main topic here, but how come your M2 Pro is twice as fast as my M1 Pro @ThomasArts? :) According to generic benchmarks, there should be no more than 10-20% performance difference between these chips. After I saw your results I compiled Erlang without extra microstate accounting and without DTrace support, to avoid any overhead from these features. I also made sure to test without using the shell (erlperf -pa . 'ppp:test(100).') but I still get ~170ms results. Do you do anything special when running your tests? What flags have you compiler Erlang with?

I don't see you on the Erlanger slack. If you don't mind joining and chatting there for a bit, I'd appreciate that. Hopefully we can all learn something about Erlang/OTP performance on Apple Silicon (or in general, based on what we find).

bjorng commented 3 months ago

This is tangental to the main topic here, but how come your M2 Pro is twice as fast as my M1 Pro

Have you compiled all Erlang code you are running with the Erlang compiler in Erlang/OTP 27?

The format of the type information in BEAM files are different in OTP 26 and OTP 27. If the format of type information is not the expected one, the JIT cannot do any type-guided optimizations, and will potentially emit worse code.

This is just a wild guess.

mkuratczyk commented 3 months ago

Yes, I compile the code with the version I use at runtime. I built from master today, compiled ppp.erl with this version, ran it with erlperf as above and my times are ~170ms, which is like twice what Thomas gets.

mkuratczyk commented 3 months ago

Oh, I found the culprit. When debugging a different issue recently I set the CFLAGS to -O0. After setting back to -O3, I get the ~87ms.

Just as an anecdote: the issue I was debugging turned out to be a problem running Docker image built for x86 on an ARM Macbook. Thanks to all the Docker/Rosetta2 magic that generally works, but recently we made some changes in RabbitMQ that triggered some bug probably (in Rosetta2 I guess? Not sure, too much low-level magic). Anyway, we get some crazy failures where the stacktraces reported don't match the source code, which is why I suspected that maybe GCC optimisations had something to do with that. The problem still exists, but since one needs to run an "incorrect" architecture of an image, I haven't reported it. Instead, we are working on having ARM builds of the Docker images in our pipelines (they've been available for the final releases for a long time, I'm just talking about our internal builds).

max-au commented 3 months ago

I'm interested in figuring out a way to tell options that were used to build ERTS. Specifically, I want erlperf to output some sort of a warning "your ERTS is built with these flags". That'd prevent situations like one provided above.

nickva commented 3 months ago

@max-au there is erlang:system_info(compile_info) that returns a proplist with cflags, ldflags and config_h flags. Maybe that could help?

1> proplists:get_value(cflags, erlang:system_info(compile_info)).
"-Werror=undef -Werror=implicit -Werror=return-type  -fno-common -g -O2 
-I...asdf_24.3.4.16/otp_src_24.3.4.16/erts/x86_64-apple-darwin22.6.0    
-DHAVE_CONFIG_H 
-Wall -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Wdeclaration-after-statement 
-DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS   -DBEAMASM=1"

erlang / otp

Performance OTP27 on binary_to_term slower #8329