Closed Kerollmops closed 2 years ago
Hey @Kerollmops 👋 thank you for the PR. Could you tell me in which version you saw the 17% perf gain? I usually test basic_batched.rs
and threaded_batched.rs
, I am not seeing much difference:
Tue Aug 17 17:16:10 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
real 0m33.049s
user 0m30.617s
sys 0m2.227s
Tue Aug 17 17:16:44 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts
real 0m30.889s
user 0m43.786s
sys 0m3.783s
I changed the common.rs module to be lib.rs file, this way cargo will not try to compile it like a binary file.
also, what would be the benefit of this?
That's strange because on my side I tested with basic_batched.rs
and threaded_batched.rs
like you and the time moved from around 30s to 25s on my iMac...
also, what would be the benefit of this?
Being able to compile by only using cargo build --release
without triggering an error as there is no main function in the common module. That's simply more rust idiomatic, don't try to compile something as a binary when it was not intended to be :)
Being able to compile by only using cargo build --release without triggering an error as there is no main function in the common module. That's simply more rust idiomatic, don't try to compile something as a binary when it was not intended to be :)
makes sense, thank you!
That's strange because on my side I tested with basic_batched.rs and threaded_batched.rs like you and the time moved from around 30s to 25s on my iMac...
We are now dealing with the issue of 'works on my machine' bug :P
Let me do the before and after tests, post the numbers here. There is also a PR (#18) which is trying to add the Circle CI setup. That should help solve the conflicts
Here is the time it takes to run threaded_batched
on my new apple M1 MacBook Pro (not my iMac), this branch is the main branch of your repository.
Mar 17 aoû 2021 14:39:06 CEST [RUST] threaded_batched.rs (100_000_000) inserts
./target/release/threaded_batched 24,77s user 1,98s system 131% cpu 20,304 total
And here is the time it takes to run this branch on my M1 MacBook Pro. In short, I have removed 18,02% of computing time.
Mar 17 aoû 2021 14:42:37 CEST [RUST] threaded_batched.rs (100_000_000) inserts
./target/release/threaded_batched 18,73s user 1,30s system 120% cpu 16,645 total
There is also a PR (#18) that is trying to add the Circle CI setup. That should help solve the conflicts.
CIs are really not made for that, most of the time CIs are simple VM machines and are not stable enough in terms of CPU, many neighbor processes are polluting the CPU.
So I restarted my machine, killed active processes and ran it again. Master:
Tue Aug 17 18:04:24 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
real 0m32.645s
user 0m30.203s
sys 0m2.247s
Tue Aug 17 18:04:58 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts
real 0m30.543s
user 0m43.470s
sys 0m3.792s
This PR:
Tue Aug 17 18:01:19 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
real 0m32.962s
user 0m30.444s
sys 0m2.210s
Tue Aug 17 18:01:53 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts
real 0m30.853s
user 0m43.909s
sys 0m3.657s
And here is the time it takes to run this branch on my M1 MacBook Pro. In short, I have removed 18,02% of computing time.
this is phenomenal, but I am not sure why I am not getting the same perf. I will try to spend more time on it and try to figure out
CIs are really not made for that, most of the time CIs are simple VM machines and are not stable enough in terms of CPU, many neighbor processes are polluting the CPU.
True! but this would help in setting up some baseline. Right now people have to run the code, do before and after benchmarks, then post here. CI would solve all these. This improves developer experience as well. Later I would run the code locally anyways before merging
Another idea is to reduce the number of insertions in the CI, from 100M to 10M, so that it runs quickly
After rerunning it on my iMac just to be sure, here is the result of the master branch:
Mar 17 aoû 2021 18:39:59 CEST [RUST] threaded_batched.rs (100_000_000) inserts
./target/release/threaded_batched 38,77s user 3,33s system 149% cpu 28,250 total
And here is the result of my branch, a 15,13% gain:
Mar 17 aoû 2021 18:49:50 CEST [RUST] threaded_batched.rs (100_000_000) inserts
./target/release/threaded_batched 25,59s user 2,57s system 117% cpu 23,976 total
I can see some gains on linux (ubuntu-20.04) thought less than 15%. This is a laptop with i7-10610U.
$ uname -a
Linux 5.13.0-1010-oem #11-Ubuntu SMP Wed Jul 28 13:04:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[git::master] $ /usr/bin/time -v ./target/release/basic_batched
Command being timed: "./target/release/basic_batched"
User time (seconds): 22.52
System time (seconds): 1.30
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.83
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1691784
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 423173
Voluntary context switches: 1
Involuntary context switches: 210
Swaps: 0
File system inputs: 0
File system outputs: 3141120
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[git::master] $ /usr/bin/time -v ./target/release/threaded_batched
Command being timed: "./target/release/threaded_batched"
User time (seconds): 29.06
System time (seconds): 5.70
Percent of CPU this job got: 146%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.72
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 5447572
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1362453
Voluntary context switches: 534336
Involuntary context switches: 5414
Swaps: 0
File system inputs: 0
File system outputs: 3146440
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[git::reduce-allocations] $ /usr/bin/time -v ./target/release/basic_batched
Command being timed: "./target/release/basic_batched"
User time (seconds): 20.21
System time (seconds): 1.19
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.41
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1695136
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 423958
Voluntary context switches: 1
Involuntary context switches: 191
Swaps: 0
File system inputs: 0
File system outputs: 3146936
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[git::reduce-allocations] $ /usr/bin/time -v ./target/release/threaded_batched
Command being timed: "./target/release/threaded_batched"
User time (seconds): 23.10
System time (seconds): 2.76
Percent of CPU this job got: 119%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.59
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2973804
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 744006
Voluntary context switches: 152510
Involuntary context switches: 4787
Swaps: 0
File system inputs: 0
File system outputs: 3209296
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Also the max-RSS for the threaded version is reduced almost by half.
Edit: adding another run on Debian/unstable with Ryzen-4800U
~/src/fast-sqlite3-inserts(master) » /usr/bin/time -v ./target/release/basic_batched
Command being timed: "./target/release/basic_batched"
User time (seconds): 25.29
System time (seconds): 1.18
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.49
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1692232
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 423302
Voluntary context switches: 139
Involuntary context switches: 80
Swaps: 0
File system inputs: 0
File system outputs: 3142016
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
~/src/fast-sqlite3-inserts(master) » /usr/bin/time -v ./target/release/threaded_batched
Command being timed: "./target/release/threaded_batched"
User time (seconds): 27.04
System time (seconds): 13.44
Percent of CPU this job got: 158%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.53
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 5478920
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1369915
Voluntary context switches: 651453
Involuntary context switches: 2476
Swaps: 0
File system inputs: 0
File system outputs: 3142232
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
~/src/fast-sqlite3-inserts(reduce-allocations) » /usr/bin/time -v ./target/release/basic_batched
Command being timed: "./target/release/basic_batched"
User time (seconds): 23.25
System time (seconds): 1.18
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.43
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1694644
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 423938
Voluntary context switches: 13
Involuntary context switches: 65
Swaps: 0
File system inputs: 0
File system outputs: 3146768
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
~/src/fast-sqlite3-inserts(reduce-allocations) » /usr/bin/time -v ./target/release/threaded_batched
Command being timed: "./target/release/threaded_batched"
User time (seconds): 25.17
System time (seconds): 4.74
Percent of CPU this job got: 122%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.42
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2947644
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 737085
Voluntary context switches: 221806
Involuntary context switches: 466
Swaps: 0
File system inputs: 0
File system outputs: 3146664
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Again, small improvement. I guess Mac OS allocator is less efficient and this optimization has greater effect there.
I am trying to get a different machine to run tests, I should have one by next week. Sorry for the long silence.
Thank you for the PR and I apologise for the delay. I had got a different machine at the start of Novemeber. But around the same time I started my batch at Recurse Center, I had been very busy with it.
I setup the machine finally, ran them today and I do see improved numbers.
new machine specs: MacBook Pro (16-inch, 2019), 2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4
master:
Mon Nov 22 18:55:46 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
real 0m30.814s
user 0m28.546s
sys 0m2.187s
Mon Nov 22 18:56:18 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts
real 0m28.466s
user 0m41.418s
sys 0m3.314s
pr:
Mon Nov 22 18:47:26 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
real 0m23.826s
user 0m21.685s
sys 0m2.057s
Mon Nov 22 18:47:50 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts
real 0m23.070s
user 0m27.512s
sys 0m2.465s
This is awesome!! 🥳🥳
Non threaded version takes 22% less time and threaded version 18% (I am gonna add this to the leaderboard)
Any theories on why both the threaded and non threaded versions are taking almost same amount of time?
Thank you for the PR and I apologise for the delay.
No worries ;)
Any theories on why both the threaded and non threaded versions are taking almost same amount of time?
This is probably because the allocator is not good at synchronizing the allocations requests between threads. Not sure!
This is probably because the allocator is not good at synchronizing the allocations requests between threads. Not sure!
but on my old machine, threaded version was 3s faster!
I have merged this, but I would still love to discuss more if we can find out this behaviour!
This pull request globally improves the performance of the program by removing most of the allocations at run time. Basically, the changes are in the lib.rs file.
common.rs
module to belib.rs
file, this way cargo will not try to compile it like a binary file.Vec
creation in theget_random_age
function.format!
calls with anAreaCode
new type that wraps aTinyStr8
. By using this instead of strings. It helps kipping memory locality by directly storing the formatted string on a pointer-sized integer (64 bits) without any indirection to read thestr
itself and also removes any possible allocation.On my side, the performance gain is around 17.6%.