[Experiment] Investigate meow, digest benchmarks

nx10 commented 2 years ago

I had a look at meow (see #173) as a possible hacktoberfest contribution, but it did not go as planned. This PR is a minimal working example of meow (no file support, no docs, no configure script for different processors / compilers) and not intended for merge, however I want to share some observations I made with this:

First of all meow does require x64 as well as hardware AES, so I assume it's not a good fit for digest over all.

However to see how fast it is I created a small benchmark:

ben <- function(object, num_reps, serialize = TRUE) {

    df <- rbenchmark::benchmark(
        "md5" = {
            digest::digest(object, "md5", serialize = serialize)
        },
        "sha1" = {
            digest::digest(object, "sha1", serialize = serialize)
        },
        "crc32" = {
            digest::digest(object, "crc32", serialize = serialize)
        },
        "sha512" = {
            digest::digest(object, "sha512", serialize = serialize)
        },
        "sha1" = {
            digest::digest(object, "sha1", serialize = serialize)
        },
        "xxhash32" = {
            digest::digest(object, "xxhash32", serialize = serialize)
        },
        "xxhash64" = {
            digest::digest(object, "xxhash64", serialize = serialize)
        },
        "murmur32" = {
            digest::digest(object, "murmur32", serialize = serialize)
        },
        "spookyhash" = {
            digest::digest(object, "spookyhash", serialize = TRUE)  # serialize=F not supported
        },
        "blake3" = {
            digest::digest(object, "blake3", serialize = serialize)
        },
        "meow" = {
            digest::digest(object, "meow", serialize = serialize)
        },
        replications = num_reps, order = "elapsed")

    df$hash_width <- lapply(df$test, \(x) nchar(digest::digest(0, x)) * 4)
    df$bandwidth <- sprintf("%.2f GB/s", object.size(object) * num_reps / 1024^3 / df$elapsed)
    df
}

I ran the following tests on my Windows 10, R 4.2, with a 12th Gen Intel(R) Core(TM) i7-12700K, 3610 Mhz, 12 Core(s), 20 Logical Processor(s).

A small size (100kB) test with many reps

```R ben(as.raw(runif(100 * 1024)), num_reps = 10000, serialize = T) # |> knitr::kable(row.names = F) ``` |test | replications| elapsed| relative| user.self| sys.self| user.child| sys.child|hash_width |bandwidth | |:----------|------------:|-------:|--------:|---------:|--------:|----------:|---------:|:----------|:---------| |spookyhash | 10000| 0.18| 1.000| 0.18| 0.00| NA| NA|128 |5.30 GB/s | |meow | 10000| 0.42| 2.333| 0.27| 0.16| NA| NA|128 |2.27 GB/s | |xxhash64 | 10000| 0.48| 2.667| 0.33| 0.16| NA| NA|64 |1.99 GB/s | |xxhash32 | 10000| 0.51| 2.833| 0.31| 0.20| NA| NA|32 |1.87 GB/s | |murmur32 | 10000| 0.64| 3.556| 0.44| 0.20| NA| NA|32 |1.49 GB/s | |crc32 | 10000| 1.00| 5.556| 0.80| 0.20| NA| NA|32 |0.95 GB/s | |blake3 | 10000| 1.33| 7.389| 1.24| 0.09| NA| NA|256 |0.72 GB/s | |sha1 | 10000| 1.34| 7.444| 1.09| 0.25| NA| NA|160 |0.71 GB/s | |sha1 | 10000| 1.40| 7.778| 1.27| 0.13| NA| NA|160 |0.68 GB/s | |md5 | 10000| 1.60| 8.889| 1.36| 0.22| NA| NA|128 |0.60 GB/s | |sha512 | 10000| 2.28| 12.667| 2.00| 0.27| NA| NA|512 |0.42 GB/s |

A larger (1GB) test with 1 rep

```R ben(as.raw(runif(1024^3)), num_reps = 1, serialize = T) # |> knitr::kable(row.names = F) ``` |test | replications| elapsed| relative| user.self| sys.self| user.child| sys.child|hash_width |bandwidth | |:----------|------------:|-------:|--------:|---------:|--------:|----------:|---------:|:----------|:----------| |spookyhash | 1| 0.08| 1.000| 0.08| 0.00| NA| NA|128 |12.50 GB/s | |meow | 1| 1.36| 17.000| 1.00| 0.36| NA| NA|128 |0.74 GB/s | |xxhash64 | 1| 1.39| 17.375| 0.84| 0.55| NA| NA|64 |0.72 GB/s | |xxhash32 | 1| 1.41| 17.625| 0.86| 0.54| NA| NA|32 |0.71 GB/s | |murmur32 | 1| 1.56| 19.500| 1.09| 0.47| NA| NA|32 |0.64 GB/s | |crc32 | 1| 1.86| 23.250| 1.41| 0.46| NA| NA|32 |0.54 GB/s | |sha1 | 1| 2.18| 27.250| 1.64| 0.55| NA| NA|160 |0.46 GB/s | |sha1 | 1| 2.19| 27.375| 1.91| 0.28| NA| NA|160 |0.46 GB/s | |blake3 | 1| 2.27| 28.375| 1.83| 0.44| NA| NA|256 |0.44 GB/s | |md5 | 1| 2.42| 30.250| 2.03| 0.39| NA| NA|128 |0.41 GB/s | |sha512 | 1| 3.15| 39.375| 2.75| 0.39| NA| NA|512 |0.32 GB/s |

meow was a bit faster than most other algorithms except spookyhash, which completely blows everything else out of the water, especially with big objects, where it was 17x-39x faster.

To check if there was something going on with serialization, I wanted to run the test again with serialization disabled, which I then found out was not supported in combination with spookyhash.

Nonetheless here are the results with serialize=F for all algorithms but spookyhash:

100kB - no serialze

```R ben(as.raw(runif(100 * 1024)), num_reps = 10000, serialize = F) # |> knitr::kable(row.names = F) ``` |test | replications| elapsed| relative| user.self| sys.self| user.child| sys.child|hash_width |bandwidth | |:----------|------------:|-------:|--------:|---------:|--------:|----------:|---------:|:----------|:---------| |meow | 10000| 0.10| 1.0| 0.09| 0.00| NA| NA|128 |9.54 GB/s | |xxhash64 | 10000| 0.13| 1.3| 0.13| 0.00| NA| NA|64 |7.34 GB/s | |xxhash32 | 10000| 0.17| 1.7| 0.17| 0.00| NA| NA|32 |5.61 GB/s | |spookyhash | 10000| 0.18| 1.8| 0.17| 0.00| NA| NA|128 |5.30 GB/s | |murmur32 | 10000| 0.31| 3.1| 0.28| 0.00| NA| NA|32 |3.08 GB/s | |crc32 | 10000| 0.62| 6.2| 0.63| 0.00| NA| NA|32 |1.54 GB/s | |sha1 | 10000| 0.98| 9.8| 0.99| 0.00| NA| NA|160 |0.97 GB/s | |blake3 | 10000| 0.99| 9.9| 0.98| 0.00| NA| NA|256 |0.96 GB/s | |sha1 | 10000| 1.03| 10.3| 1.00| 0.02| NA| NA|160 |0.93 GB/s | |md5 | 10000| 1.13| 11.3| 1.13| 0.00| NA| NA|128 |0.84 GB/s | |sha512 | 10000| 1.73| 17.3| 1.74| 0.00| NA| NA|512 |0.55 GB/s | (spookyhash serialize=T)

1GB - no serialze

```R ben(as.raw(runif(1024^3)), num_reps = 1, serialize = F) # |> knitr::kable(row.names = F) ``` |test | replications| elapsed| relative| user.self| sys.self| user.child| sys.child|hash_width |bandwidth | |:----------|------------:|-------:|--------:|---------:|--------:|----------:|---------:|:----------|:----------| |meow | 1| 0.03| 1.000| 0.03| 0| NA| NA|128 |33.33 GB/s | |xxhash64 | 1| 0.08| 2.667| 0.08| 0| NA| NA|64 |12.50 GB/s | |spookyhash | 1| 0.08| 2.667| 0.07| 0| NA| NA|128 |12.50 GB/s | |xxhash32 | 1| 0.11| 3.667| 0.11| 0| NA| NA|32 |9.09 GB/s | |murmur32 | 1| 0.23| 7.667| 0.23| 0| NA| NA|32 |4.35 GB/s | |crc32 | 1| 0.56| 18.667| 0.56| 0| NA| NA|32 |1.79 GB/s | |sha1 | 1| 0.93| 31.000| 0.94| 0| NA| NA|160 |1.08 GB/s | |blake3 | 1| 0.95| 31.667| 0.95| 0| NA| NA|256 |1.05 GB/s | |sha1 | 1| 0.96| 32.000| 0.95| 0| NA| NA|160 |1.04 GB/s | |md5 | 1| 1.10| 36.667| 1.10| 0| NA| NA|128 |0.91 GB/s | |sha512 | 1| 1.78| 59.333| 1.78| 0| NA| NA|512 |0.56 GB/s | (spookyhash serialize=T)

Interestingly spookyhash with serialization seems to hold itself up against xxhash and meow even when disabling serialization for the other two (although meow is 3x faster with the large data).

I think there is something strange going on with spookyhash. I noticed in the code that it is treated differently than the other hash functions. Is it possible that the spookyhash implementation somehow avoids a copy? Can we avoid this copy in the other algorithms to make digest substantially faster? Or is the spookyhash implementation broken?

Of course all of this could also just be the result of an error I made.

Some closing points:

Possibly unrelated, but spookyhash does also not seem to respect raw=TRUE
For reference an independent hash performance benchmark: https://github.com/Cyan4973/xxHash/wiki/Performance-comparison

Cheers, Florian

eddelbuettel commented 2 years ago

Nice work. I am in favour of getting you the cherished PR mark :) Maybe as a vignette? (I am a little funny with vignettes, like simplermarkdown and actually like them "fixed" esp for something like benchmarking to not rebuild on each package rebuild.)

But before we get there, any idea why CI failed?

nx10 commented 2 years ago

Thanks!

Maybe as a vignette?

Do you mean adding the benchmark as a vignette in a separate PR without meow? I would be happy to!

any idea why CI failed?

Sorry about that, I missed adding meow to the default arguments. The tests should now pass.

eddelbuettel commented 2 years ago

Do you mean adding the benchmark as a vignette in a separate PR without meow? I would be happy to!

It sounded like you were not sure about meow as a contribution so I offered the benchmark as an alternative. Happy to do both.

nx10 commented 2 years ago

Do you mean adding the benchmark as a vignette in a separate PR without meow? I would be happy to!

It sounded like you were not sure about meow as a contribution so I offered the benchmark as an alternative. Happy to do both.

Yes this is exactly what I meant. Sorry about the confusion.

nx10 commented 2 years ago

Any thoughts on why spookyhash is so much faster than it should be?

I noticed that spookyhash calls R_Serialize from C while the other algorithms call serialize from R.

eddelbuettel commented 2 years ago

Nope! Time for some profiling, maybe?

nx10 commented 2 years ago

I think I found something interesting:

Serialization memory out streams/connections seem to start with a reserved space of 0 in base R:

https://github.com/wch/r-source/blob/90e365d4f7066b58201275333144be3753835c12/src/main/serialize.c#L2810-L2811

I did a quick test what happens of you preallocate to object.size() (with and without padding and reading from the stream):

serialize_fast <- function(object) {
    zz <- rawConnection(raw(object.size(object)-object.size(raw(0))), "r+")
    serialize(object, zz)
    re <- rawConnectionValue(zz)
    close(zz)
    re
}

serialize_fast_pad <- function(object) {
    zz <- rawConnection(raw(object.size(object)-object.size(raw(0))+32), "r+")
    serialize(object, zz)
    re <- rawConnectionValue(zz)
    close(zz)
    re
}

serialize_fast_pad_noread <- function(object) {
    zz <- rawConnection(raw(object.size(object)-object.size(raw(0))+32), "r+")
    serialize(object, zz)
    close(zz)
}

ben2 <- function(object, num_reps) {

    df <- rbenchmark::benchmark(
        "old" = {
            serialize(object, NULL)
        },
        "new" = {
            serialize_fast(object)
        },
        "new_pad" = {
            serialize_fast_pad(object)
        },
        "new_pad_noread" = {
            serialize_fast_pad_noread(object)
        },
        replications = num_reps, order = "elapsed")

    df$bandwidth <- sprintf("%.2f GB/s", object.size(object) * num_reps / 1024^3 / df$elapsed)
    df
}

This leads to > 2x - 3x the performance of serialize():

ben2(as.raw(runif(1024^3)), num_reps = 1) # |> knitr::kable(row.names = F)

test	replications	elapsed	relative	user.self	sys.self	user.child	sys.child	bandwidth
new_pad_noread	1	0.39	1.000	0.30	0.10	NA	NA	2.56 GB/s
new_pad	1	0.56	1.436	0.40	0.17	NA	NA	1.79 GB/s
new	1	0.74	1.897	0.58	0.15	NA	NA	1.35 GB/s
old	1	1.26	3.231	0.81	0.45	NA	NA	0.79 GB/s

I will try to implement this directly in C next.

nx10 commented 2 years ago

I have implemented this now in C and did some more benchmarks:

https://github.com/nx10/serialize_prealloc

I think this should be fixed in base R, and I will look into how to submit a PR/patch there for now.

Feel free to close this PR, I will come back to digest soon for the benchmark vignette. Thank you for your time!

eddelbuettel commented 2 years ago

Nice stuff! I hope you find an R Core sponsor of that.

Now that leaves us with

meow: do you think there is/isn't a PR here?
new vignette on benchmarking: no rush

nx10 commented 2 years ago

meow: do you think there is/isn't a PR here?

I don't think it's woth it. xxhash has close to the same speed with AVX2 extensions enabled, and meow does not work at all on other processors (https://github.com/Cyan4973/xxHash/wiki/Performance-comparison) So if something really fast is needed in digest, I would recommend looking into conditionally enabling the AVX2 extensions for xxhash.

Maybe xxhash should also be updated (digest uses version v0.6.5 current is v0.8.1)

new vignette on benchmarking: no rush

Thanks! On my to-do list :)

eddelbuettel commented 1 year ago

Closing this now.

eddelbuettel / digest

[Experiment] Investigate meow, digest benchmarks #180