Closed WickedShell closed 7 years ago
@WickedShell thank you a lot for the patch! I'll bump the version and release it shortly.
No worries, it was the first step to enable using it for my application :)
I made the mistake of turning on reflection warnings though, and am a bit concerned about all the associated overhead with the amount of reflection happening with each type. I started adding all the appropriate hinting, cleaning up the basic Type's was quite straightforward, but after that it started to get really messy. Not sure if I'll be able to sort out the rest of the appropriate hinting. I'll need to profile it to see if I can take the reflection penalty for actually using buffy.
Sure. Type hinting would indeed take some time to get done there. I'll be really curious to hear about your findings. Even if you end up not using it, would you mind to post your findings here?
Sure, I can post the warnings here, but I'm sure how you know how to get those for yourself :) I'll get back to you with the performance part later.
Disclaimer: I am not a benchmarking expert, this is the first time I've used criterium, and the test cases were built arbitrarily. All times came from criterium, with YourKit in attach mode doing object allocation counting. Results are on a i5-4670K CPU @ 3.40GHz, 32GB RAM, openjdk version "1.8.0_121"
I welcome feedback from others, this post was the collected results as I sat down and performed the benchmarking.
Using master: 1236 reflection objects were created per call to (do (set-field my-buffer :f1 125) (set-field my-buffer :f3 Long/MAX_VALUE) (set-field my-buffer :f2 156))
user=> (def my-spec (spec :f1 (int32-type) :f2 (short-type) :f3 (long-type)))
#'user/my-spec
user=> (def my-buffer (compose-buffer my-spec))
#'user/my-buffer
user=> (bench (do (set-field my-buffer :f1 125) (set-field my-buffer :f3 Long/MAX_VALUE) (set-field my-buffer :f2 156)))
Evaluation count : 1255500 in 60 samples of 20925 calls.
Execution time mean : 47.433651 µs
Execution time std-deviation : 632.537590 ns
Execution time lower quantile : 46.524679 µs ( 2.5%)
Execution time upper quantile : 48.529306 µs (97.5%)
Overhead used : 7.281733 ns
Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Encoding an equivalent buffer within Gloss resulted in no reflection objects being created, and a much faster run time.
user=> (defcodec my-codec (ordered-map :f1 :int32 :f2 :int16 :f3 :int64))
#'user/my-codec
user=> (bench (encode my-codec {:f1 125 :f3 Long/MAX_VALUE :f2 156}))
Evaluation count : 26245500 in 60 samples of 437425 calls.
Execution time mean : 2.290182 µs
Execution time std-deviation : 30.404231 ns
Execution time lower quantile : 2.248623 µs ( 2.5%)
Execution time upper quantile : 2.363533 µs (97.5%)
Overhead used : 1.267376 ns
Found 4 outliers in 60 samples (6.6667 %)
low-severe 4 (6.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Now, when we start to type hint however the results get interesting. I've fixed most of the hinting within the first part of types.clj (basically all the byte/short/medium/long/boolean and unsigned variants there of). This results in no reflection objects, and a much better runtime:
user=> (bench (do (set-field my-buffer :f1 125) (set-field my-buffer :f3 Long/MAX_VALUE) (set-field my-buffer :f2 156)))
Evaluation count : 55350420 in 60 samples of 922507 calls.
Execution time mean : 1.091771 µs
Execution time std-deviation : 8.658495 ns
Execution time lower quantile : 1.080541 µs ( 2.5%)
Execution time upper quantile : 1.112170 µs (97.5%)
Overhead used : 7.824414 ns
Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
At this point it's worth pointing out that Gloss is doing more work, Gloss is actually creating a new set of byte buffers for us everytime which increases it's overhead. A more even comparison would be to compose (and release) a new buffer every call, as well as to force Gloss to create us a contigous buffer
Let's adjust the test case to encode a random value every time, first the baseline cost of calling (rand-int ):
user=> (bench (rand-int 200))
Evaluation count : 884678160 in 60 samples of 14744636 calls.
Execution time mean : 60.312011 ns
Execution time std-deviation : 0.839096 ns
Execution time lower quantile : 59.324911 ns ( 2.5%)
Execution time upper quantile : 61.927610 ns (97.5%)
Overhead used : 7.824414 ns
Buffy composing new buffers/releasing them now performs at (no reflection):
user=> (bench (let [my-new-buffer (compose-buffer my-spec)] (set-field my-new-buffer :f1 (rand-int 200)) (set-field my-buffer :f3 (rand-int Integer/MAX_VALUE)) (set-field my-buffer :f2 (rand-int 200)) (.release (.buffer my-new-buffer))))
Evaluation count : 2380140 in 60 samples of 39669 calls.
Execution time mean : 25.050777 µs
Execution time std-deviation : 290.651567 ns
Execution time lower quantile : 24.628053 µs ( 2.5%)
Execution time upper quantile : 25.709276 µs (97.5%)
Overhead used : 7.824414 ns
Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
While Gloss is pulling in at:
user=> (bench (contiguous (encode my-codec {:f1 (rand-int 200) :f3 (rand-int Integer/MAX_VALUE) :f2 (rand-int 200)})))
Evaluation count : 11280900 in 60 samples of 188015 calls.
Execution time mean : 5.378694 µs
Execution time std-deviation : 57.396730 ns
Execution time lower quantile : 5.287055 µs ( 2.5%)
Execution time upper quantile : 5.518952 µs (97.5%)
Overhead used : 1.267376 ns
Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Thanks for publishing your findings! I'll be working on improving performance. TBH I've never set performance as a top-level priority, since prior to release it was feature incomplete. Probably it is a good point to actually take a deeper look at it.
It's also worth mentioning that buffy is using netty byte buffers, while gloss is working with barebone JVM ones. There're some advantages when using netty buffers, which although comes with some overhead.
Yeah, with the reflection cleanups, buffy gets much faster, if you have a use case where you can keep using the same buffers (which does describe my use case for the most part). Although I think I missed the best way to get a native buffer out of netty, the only way I found was to take the buffer I had composed, access it's .buffer object to get the netty one, and then copy it into a byte[]. (Just happens to be the intermediate form I need for a different java library)
I can send in the reflection work I've already done and am confident in and save you from redoing it if you want. Once it's outside of the types I will admit I'm not sure what a lot of the hinting should be to avoid problems.
Sure, just submit a PR and I'll pick up from there.
The best way to allocate Netty buffers is to use allocators: https://netty.io/4.0/api/io/netty/buffer/ByteBufAllocator.html
I'd like to highlight that it also supports a bunch of different buffers natively: direct, heap, off-heap etc, which was the main point when it was brought in.
Implement the test suites for all of them, as well as add the missing long test.
Not really happy with the naming convention I used of u(base type). It makes sense on uint32, but it feels really off on all the byte/short/medium/long use cases. Happy to swap that to be a different name.