Rework the random functions to reduce allocations

Kerollmops commented 3 years ago

This pull request globally improves the performance of the program by removing most of the allocations at run time. Basically, the changes are in the lib.rs file.

I changed the common.rs module to be lib.rs file, this way cargo will not try to compile it like a binary file.
I removed the Vec creation in the get_random_age function.
I replaced the format! calls with an AreaCode new type that wraps a TinyStr8. By using this instead of strings. It helps kipping memory locality by directly storing the formatted string on a pointer-sized integer (64 bits) without any indirection to read the str itself and also removes any possible allocation.

On my side, the performance gain is around 17.6%.

avinassh commented 3 years ago

Hey @Kerollmops 👋 thank you for the PR. Could you tell me in which version you saw the 17% perf gain? I usually test basic_batched.rs and threaded_batched.rs, I am not seeing much difference:

Tue Aug 17 17:16:10 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts

real    0m33.049s
user    0m30.617s
sys 0m2.227s
Tue Aug 17 17:16:44 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts

real    0m30.889s
user    0m43.786s
sys 0m3.783s

I changed the common.rs module to be lib.rs file, this way cargo will not try to compile it like a binary file.

also, what would be the benefit of this?

Kerollmops commented 3 years ago

That's strange because on my side I tested with basic_batched.rs and threaded_batched.rs like you and the time moved from around 30s to 25s on my iMac...

also, what would be the benefit of this?

Being able to compile by only using cargo build --release without triggering an error as there is no main function in the common module. That's simply more rust idiomatic, don't try to compile something as a binary when it was not intended to be :)

avinassh commented 3 years ago

Being able to compile by only using cargo build --release without triggering an error as there is no main function in the common module. That's simply more rust idiomatic, don't try to compile something as a binary when it was not intended to be :)

makes sense, thank you!

avinassh commented 3 years ago

That's strange because on my side I tested with basic_batched.rs and threaded_batched.rs like you and the time moved from around 30s to 25s on my iMac...

We are now dealing with the issue of 'works on my machine' bug :P

Let me do the before and after tests, post the numbers here. There is also a PR (#18) which is trying to add the Circle CI setup. That should help solve the conflicts

Kerollmops commented 3 years ago

Here is the time it takes to run threaded_batched on my new apple M1 MacBook Pro (not my iMac), this branch is the main branch of your repository.

Mar 17 aoû 2021 14:39:06 CEST [RUST] threaded_batched.rs (100_000_000) inserts
./target/release/threaded_batched  24,77s user 1,98s system 131% cpu 20,304 total

And here is the time it takes to run this branch on my M1 MacBook Pro. In short, I have removed 18,02% of computing time.

Mar 17 aoû 2021 14:42:37 CEST [RUST] threaded_batched.rs (100_000_000) inserts
./target/release/threaded_batched  18,73s user 1,30s system 120% cpu 16,645 total

There is also a PR (#18) that is trying to add the Circle CI setup. That should help solve the conflicts.

CIs are really not made for that, most of the time CIs are simple VM machines and are not stable enough in terms of CPU, many neighbor processes are polluting the CPU.

avinassh commented 3 years ago

So I restarted my machine, killed active processes and ran it again. Master:

Tue Aug 17 18:04:24 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts

real    0m32.645s
user    0m30.203s
sys 0m2.247s
Tue Aug 17 18:04:58 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts

real    0m30.543s
user    0m43.470s
sys 0m3.792s

This PR:

Tue Aug 17 18:01:19 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts

real    0m32.962s
user    0m30.444s
sys 0m2.210s
Tue Aug 17 18:01:53 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts

real    0m30.853s
user    0m43.909s
sys 0m3.657s

avinassh commented 3 years ago

And here is the time it takes to run this branch on my M1 MacBook Pro. In short, I have removed 18,02% of computing time.

this is phenomenal, but I am not sure why I am not getting the same perf. I will try to spend more time on it and try to figure out

CIs are really not made for that, most of the time CIs are simple VM machines and are not stable enough in terms of CPU, many neighbor processes are polluting the CPU.

True! but this would help in setting up some baseline. Right now people have to run the code, do before and after benchmarks, then post here. CI would solve all these. This improves developer experience as well. Later I would run the code locally anyways before merging

Another idea is to reduce the number of insertions in the CI, from 100M to 10M, so that it runs quickly

Kerollmops commented 3 years ago

After rerunning it on my iMac just to be sure, here is the result of the master branch:

Mar 17 aoû 2021 18:39:59 CEST [RUST] threaded_batched.rs (100_000_000) inserts
./target/release/threaded_batched  38,77s user 3,33s system 149% cpu 28,250 total

And here is the result of my branch, a 15,13% gain:

Mar 17 aoû 2021 18:49:50 CEST [RUST] threaded_batched.rs (100_000_000) inserts
./target/release/threaded_batched  25,59s user 2,57s system 117% cpu 23,976 total

luben commented 3 years ago

I can see some gains on linux (ubuntu-20.04) thought less than 15%. This is a laptop with i7-10610U.

$ uname -a
Linux 5.13.0-1010-oem #11-Ubuntu SMP Wed Jul 28 13:04:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

[git::master] $ /usr/bin/time -v ./target/release/basic_batched
        Command being timed: "./target/release/basic_batched"
        User time (seconds): 22.52
        System time (seconds): 1.30
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.83
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1691784
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 423173
        Voluntary context switches: 1
        Involuntary context switches: 210
        Swaps: 0
        File system inputs: 0
        File system outputs: 3141120
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

[git::master] $ /usr/bin/time -v ./target/release/threaded_batched
        Command being timed: "./target/release/threaded_batched"
        User time (seconds): 29.06
        System time (seconds): 5.70
        Percent of CPU this job got: 146%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.72
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 5447572
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1362453
        Voluntary context switches: 534336
        Involuntary context switches: 5414
        Swaps: 0
        File system inputs: 0
        File system outputs: 3146440
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

[git::reduce-allocations] $ /usr/bin/time -v ./target/release/basic_batched
        Command being timed: "./target/release/basic_batched"
        User time (seconds): 20.21
        System time (seconds): 1.19
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.41
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1695136
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 423958
        Voluntary context switches: 1
        Involuntary context switches: 191
        Swaps: 0
        File system inputs: 0
        File system outputs: 3146936
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

[git::reduce-allocations] $ /usr/bin/time -v ./target/release/threaded_batched
        Command being timed: "./target/release/threaded_batched"
        User time (seconds): 23.10
        System time (seconds): 2.76
        Percent of CPU this job got: 119%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.59
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2973804
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 744006
        Voluntary context switches: 152510
        Involuntary context switches: 4787
        Swaps: 0
        File system inputs: 0
        File system outputs: 3209296
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Also the max-RSS for the threaded version is reduced almost by half.

Edit: adding another run on Debian/unstable with Ryzen-4800U

~/src/fast-sqlite3-inserts(master) » /usr/bin/time -v ./target/release/basic_batched
        Command being timed: "./target/release/basic_batched"
        User time (seconds): 25.29
        System time (seconds): 1.18
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.49
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1692232
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 423302
        Voluntary context switches: 139
        Involuntary context switches: 80
        Swaps: 0
        File system inputs: 0
        File system outputs: 3142016
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

~/src/fast-sqlite3-inserts(master) » /usr/bin/time -v ./target/release/threaded_batched
        Command being timed: "./target/release/threaded_batched"
        User time (seconds): 27.04
        System time (seconds): 13.44
        Percent of CPU this job got: 158%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.53
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 5478920
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1369915
        Voluntary context switches: 651453
        Involuntary context switches: 2476
        Swaps: 0
        File system inputs: 0
        File system outputs: 3142232
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

~/src/fast-sqlite3-inserts(reduce-allocations) » /usr/bin/time -v ./target/release/basic_batched
        Command being timed: "./target/release/basic_batched"
        User time (seconds): 23.25
        System time (seconds): 1.18
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.43
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1694644
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 423938
        Voluntary context switches: 13
        Involuntary context switches: 65
        Swaps: 0
        File system inputs: 0
        File system outputs: 3146768
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
~/src/fast-sqlite3-inserts(reduce-allocations) » /usr/bin/time -v ./target/release/threaded_batched
        Command being timed: "./target/release/threaded_batched"
        User time (seconds): 25.17
        System time (seconds): 4.74
        Percent of CPU this job got: 122%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.42
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2947644
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 737085
        Voluntary context switches: 221806
        Involuntary context switches: 466
        Swaps: 0
        File system inputs: 0
        File system outputs: 3146664
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Again, small improvement. I guess Mac OS allocator is less efficient and this optimization has greater effect there.

avinassh commented 3 years ago

I am trying to get a different machine to run tests, I should have one by next week. Sorry for the long silence.

avinassh commented 3 years ago

Thank you for the PR and I apologise for the delay. I had got a different machine at the start of Novemeber. But around the same time I started my batch at Recurse Center, I had been very busy with it.

I setup the machine finally, ran them today and I do see improved numbers.

new machine specs: MacBook Pro (16-inch, 2019), 2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4

master:

Mon Nov 22 18:55:46 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts

real    0m30.814s
user    0m28.546s
sys 0m2.187s

Mon Nov 22 18:56:18 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts

real    0m28.466s
user    0m41.418s
sys 0m3.314s

pr:

Mon Nov 22 18:47:26 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts

real    0m23.826s
user    0m21.685s
sys 0m2.057s

Mon Nov 22 18:47:50 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts

real    0m23.070s
user    0m27.512s
sys 0m2.465s

This is awesome!! 🥳🥳

Non threaded version takes 22% less time and threaded version 18% (I am gonna add this to the leaderboard)

Any theories on why both the threaded and non threaded versions are taking almost same amount of time?

Kerollmops commented 3 years ago

Thank you for the PR and I apologise for the delay.

No worries ;)

Any theories on why both the threaded and non threaded versions are taking almost same amount of time?

This is probably because the allocator is not good at synchronizing the allocations requests between threads. Not sure!

avinassh commented 2 years ago

This is probably because the allocator is not good at synchronizing the allocations requests between threads. Not sure!

but on my old machine, threaded version was 3s faster!

avinassh commented 2 years ago

I have merged this, but I would still love to discuss more if we can find out this behaviour!

avinassh / fast-sqlite3-inserts

Rework the random functions to reduce allocations #19