Closed nx10 closed 1 year ago
Nice work. I am in favour of getting you the cherished PR mark :) Maybe as a vignette? (I am a little funny with vignettes, like simplermarkdown
and actually like them "fixed" esp for something like benchmarking to not rebuild on each package rebuild.)
But before we get there, any idea why CI failed?
Thanks!
Maybe as a vignette?
Do you mean adding the benchmark as a vignette in a separate PR without meow? I would be happy to!
any idea why CI failed?
Sorry about that, I missed adding meow to the default arguments. The tests should now pass.
Do you mean adding the benchmark as a vignette in a separate PR without meow? I would be happy to!
It sounded like you were not sure about meow as a contribution so I offered the benchmark as an alternative. Happy to do both.
Do you mean adding the benchmark as a vignette in a separate PR without meow? I would be happy to!
It sounded like you were not sure about meow as a contribution so I offered the benchmark as an alternative. Happy to do both.
Yes this is exactly what I meant. Sorry about the confusion.
Any thoughts on why spookyhash is so much faster than it should be?
I noticed that spookyhash calls R_Serialize
from C while the other algorithms call serialize
from R.
Nope! Time for some profiling, maybe?
I think I found something interesting:
Serialization memory out streams/connections seem to start with a reserved space of 0 in base R:
I did a quick test what happens of you preallocate to object.size()
(with and without padding and reading from the stream):
serialize_fast <- function(object) {
zz <- rawConnection(raw(object.size(object)-object.size(raw(0))), "r+")
serialize(object, zz)
re <- rawConnectionValue(zz)
close(zz)
re
}
serialize_fast_pad <- function(object) {
zz <- rawConnection(raw(object.size(object)-object.size(raw(0))+32), "r+")
serialize(object, zz)
re <- rawConnectionValue(zz)
close(zz)
re
}
serialize_fast_pad_noread <- function(object) {
zz <- rawConnection(raw(object.size(object)-object.size(raw(0))+32), "r+")
serialize(object, zz)
close(zz)
}
ben2 <- function(object, num_reps) {
df <- rbenchmark::benchmark(
"old" = {
serialize(object, NULL)
},
"new" = {
serialize_fast(object)
},
"new_pad" = {
serialize_fast_pad(object)
},
"new_pad_noread" = {
serialize_fast_pad_noread(object)
},
replications = num_reps, order = "elapsed")
df$bandwidth <- sprintf("%.2f GB/s", object.size(object) * num_reps / 1024^3 / df$elapsed)
df
}
This leads to > 2x - 3x the performance of serialize()
:
ben2(as.raw(runif(1024^3)), num_reps = 1) # |> knitr::kable(row.names = F)
test | replications | elapsed | relative | user.self | sys.self | user.child | sys.child | bandwidth |
---|---|---|---|---|---|---|---|---|
new_pad_noread | 1 | 0.39 | 1.000 | 0.30 | 0.10 | NA | NA | 2.56 GB/s |
new_pad | 1 | 0.56 | 1.436 | 0.40 | 0.17 | NA | NA | 1.79 GB/s |
new | 1 | 0.74 | 1.897 | 0.58 | 0.15 | NA | NA | 1.35 GB/s |
old | 1 | 1.26 | 3.231 | 0.81 | 0.45 | NA | NA | 0.79 GB/s |
I will try to implement this directly in C next.
I have implemented this now in C and did some more benchmarks:
https://github.com/nx10/serialize_prealloc
I think this should be fixed in base R, and I will look into how to submit a PR/patch there for now.
Feel free to close this PR, I will come back to digest soon for the benchmark vignette. Thank you for your time!
Nice stuff! I hope you find an R Core sponsor of that.
Now that leaves us with
meow: do you think there is/isn't a PR here?
I don't think it's woth it. xxhash has close to the same speed with AVX2 extensions enabled, and meow does not work at all on other processors (https://github.com/Cyan4973/xxHash/wiki/Performance-comparison) So if something really fast is needed in digest, I would recommend looking into conditionally enabling the AVX2 extensions for xxhash.
Maybe xxhash should also be updated (digest uses version v0.6.5 current is v0.8.1)
new vignette on benchmarking: no rush
Thanks! On my to-do list :)
Closing this now.
I had a look at meow (see #173) as a possible hacktoberfest contribution, but it did not go as planned. This PR is a minimal working example of meow (no file support, no docs, no configure script for different processors / compilers) and not intended for merge, however I want to share some observations I made with this:
First of all meow does require x64 as well as hardware AES, so I assume it's not a good fit for
digest
over all.However to see how fast it is I created a small benchmark:
I ran the following tests on my Windows 10, R 4.2, with a 12th Gen Intel(R) Core(TM) i7-12700K, 3610 Mhz, 12 Core(s), 20 Logical Processor(s).
A small size (100kB) test with many reps
```R ben(as.raw(runif(100 * 1024)), num_reps = 10000, serialize = T) # |> knitr::kable(row.names = F) ``` |test | replications| elapsed| relative| user.self| sys.self| user.child| sys.child|hash_width |bandwidth | |:----------|------------:|-------:|--------:|---------:|--------:|----------:|---------:|:----------|:---------| |spookyhash | 10000| 0.18| 1.000| 0.18| 0.00| NA| NA|128 |5.30 GB/s | |meow | 10000| 0.42| 2.333| 0.27| 0.16| NA| NA|128 |2.27 GB/s | |xxhash64 | 10000| 0.48| 2.667| 0.33| 0.16| NA| NA|64 |1.99 GB/s | |xxhash32 | 10000| 0.51| 2.833| 0.31| 0.20| NA| NA|32 |1.87 GB/s | |murmur32 | 10000| 0.64| 3.556| 0.44| 0.20| NA| NA|32 |1.49 GB/s | |crc32 | 10000| 1.00| 5.556| 0.80| 0.20| NA| NA|32 |0.95 GB/s | |blake3 | 10000| 1.33| 7.389| 1.24| 0.09| NA| NA|256 |0.72 GB/s | |sha1 | 10000| 1.34| 7.444| 1.09| 0.25| NA| NA|160 |0.71 GB/s | |sha1 | 10000| 1.40| 7.778| 1.27| 0.13| NA| NA|160 |0.68 GB/s | |md5 | 10000| 1.60| 8.889| 1.36| 0.22| NA| NA|128 |0.60 GB/s | |sha512 | 10000| 2.28| 12.667| 2.00| 0.27| NA| NA|512 |0.42 GB/s |A larger (1GB) test with 1 rep
```R ben(as.raw(runif(1024^3)), num_reps = 1, serialize = T) # |> knitr::kable(row.names = F) ``` |test | replications| elapsed| relative| user.self| sys.self| user.child| sys.child|hash_width |bandwidth | |:----------|------------:|-------:|--------:|---------:|--------:|----------:|---------:|:----------|:----------| |spookyhash | 1| 0.08| 1.000| 0.08| 0.00| NA| NA|128 |12.50 GB/s | |meow | 1| 1.36| 17.000| 1.00| 0.36| NA| NA|128 |0.74 GB/s | |xxhash64 | 1| 1.39| 17.375| 0.84| 0.55| NA| NA|64 |0.72 GB/s | |xxhash32 | 1| 1.41| 17.625| 0.86| 0.54| NA| NA|32 |0.71 GB/s | |murmur32 | 1| 1.56| 19.500| 1.09| 0.47| NA| NA|32 |0.64 GB/s | |crc32 | 1| 1.86| 23.250| 1.41| 0.46| NA| NA|32 |0.54 GB/s | |sha1 | 1| 2.18| 27.250| 1.64| 0.55| NA| NA|160 |0.46 GB/s | |sha1 | 1| 2.19| 27.375| 1.91| 0.28| NA| NA|160 |0.46 GB/s | |blake3 | 1| 2.27| 28.375| 1.83| 0.44| NA| NA|256 |0.44 GB/s | |md5 | 1| 2.42| 30.250| 2.03| 0.39| NA| NA|128 |0.41 GB/s | |sha512 | 1| 3.15| 39.375| 2.75| 0.39| NA| NA|512 |0.32 GB/s |meow was a bit faster than most other algorithms except spookyhash, which completely blows everything else out of the water, especially with big objects, where it was 17x-39x faster.
To check if there was something going on with serialization, I wanted to run the test again with serialization disabled, which I then found out was not supported in combination with spookyhash.
Nonetheless here are the results with
serialize=F
for all algorithms but spookyhash:100kB - no serialze
```R ben(as.raw(runif(100 * 1024)), num_reps = 10000, serialize = F) # |> knitr::kable(row.names = F) ``` |test | replications| elapsed| relative| user.self| sys.self| user.child| sys.child|hash_width |bandwidth | |:----------|------------:|-------:|--------:|---------:|--------:|----------:|---------:|:----------|:---------| |meow | 10000| 0.10| 1.0| 0.09| 0.00| NA| NA|128 |9.54 GB/s | |xxhash64 | 10000| 0.13| 1.3| 0.13| 0.00| NA| NA|64 |7.34 GB/s | |xxhash32 | 10000| 0.17| 1.7| 0.17| 0.00| NA| NA|32 |5.61 GB/s | |spookyhash | 10000| 0.18| 1.8| 0.17| 0.00| NA| NA|128 |5.30 GB/s | |murmur32 | 10000| 0.31| 3.1| 0.28| 0.00| NA| NA|32 |3.08 GB/s | |crc32 | 10000| 0.62| 6.2| 0.63| 0.00| NA| NA|32 |1.54 GB/s | |sha1 | 10000| 0.98| 9.8| 0.99| 0.00| NA| NA|160 |0.97 GB/s | |blake3 | 10000| 0.99| 9.9| 0.98| 0.00| NA| NA|256 |0.96 GB/s | |sha1 | 10000| 1.03| 10.3| 1.00| 0.02| NA| NA|160 |0.93 GB/s | |md5 | 10000| 1.13| 11.3| 1.13| 0.00| NA| NA|128 |0.84 GB/s | |sha512 | 10000| 1.73| 17.3| 1.74| 0.00| NA| NA|512 |0.55 GB/s | (spookyhash serialize=T)1GB - no serialze
```R ben(as.raw(runif(1024^3)), num_reps = 1, serialize = F) # |> knitr::kable(row.names = F) ``` |test | replications| elapsed| relative| user.self| sys.self| user.child| sys.child|hash_width |bandwidth | |:----------|------------:|-------:|--------:|---------:|--------:|----------:|---------:|:----------|:----------| |meow | 1| 0.03| 1.000| 0.03| 0| NA| NA|128 |33.33 GB/s | |xxhash64 | 1| 0.08| 2.667| 0.08| 0| NA| NA|64 |12.50 GB/s | |spookyhash | 1| 0.08| 2.667| 0.07| 0| NA| NA|128 |12.50 GB/s | |xxhash32 | 1| 0.11| 3.667| 0.11| 0| NA| NA|32 |9.09 GB/s | |murmur32 | 1| 0.23| 7.667| 0.23| 0| NA| NA|32 |4.35 GB/s | |crc32 | 1| 0.56| 18.667| 0.56| 0| NA| NA|32 |1.79 GB/s | |sha1 | 1| 0.93| 31.000| 0.94| 0| NA| NA|160 |1.08 GB/s | |blake3 | 1| 0.95| 31.667| 0.95| 0| NA| NA|256 |1.05 GB/s | |sha1 | 1| 0.96| 32.000| 0.95| 0| NA| NA|160 |1.04 GB/s | |md5 | 1| 1.10| 36.667| 1.10| 0| NA| NA|128 |0.91 GB/s | |sha512 | 1| 1.78| 59.333| 1.78| 0| NA| NA|512 |0.56 GB/s | (spookyhash serialize=T)Interestingly spookyhash with serialization seems to hold itself up against xxhash and meow even when disabling serialization for the other two (although meow is 3x faster with the large data).
I think there is something strange going on with spookyhash. I noticed in the code that it is treated differently than the other hash functions. Is it possible that the spookyhash implementation somehow avoids a copy? Can we avoid this copy in the other algorithms to make digest substantially faster? Or is the spookyhash implementation broken?
Of course all of this could also just be the result of an error I made.
Some closing points:
raw=TRUE
Cheers, Florian