Closed pearsonca closed 2 months ago
I am not (yet ?) convinced we are making net gains here.
In terms of the next-level consolidation, can consider md5:
md5_context ctx;
output_length = 16; // produces 16*8 = 128 bits
unsigned char md5sum[output_length];
md5_starts( &ctx );
md5_update( &ctx, txt, nChar);
md5_finish( &ctx, md5sum );
return _store_from_char_ptr(md5sum, output_length, leaveRaw);
vs
md5_context ctx;
output_length = 16;
unsigned char md5sum[output_length];
md5_starts( &ctx );
FILEHASH(md5_update( &ctx, buf, nChar ));
md5_finish( &ctx, md5sum );
return _store_from_char_ptr(md5sum, output_length, leaveRaw);
... the other cases are similarly basically-just-the-same, so some logic on file-vs-not could eliminate the need for the 100+ cases.
I am not (yet ?) convinced we are making net gains here.
Fair - I'm going to make some benchmarking checks here, though I know those are imperfect.
(though end of day here, so not for a bit)
A little performance testing; summarily, this PR version (digest_alt in this comparison) seems reliably better (only the last comparison bucks trend). Around ~10%+ reduction for my use case (lots of small hashes). I suspect there'd be relatively low benefit for hashing, say, files.
> randomwords <- replicate(1e4, paste0(sample(c(letters, LETTERS, 0:9), 20), collapse = ""))
> # baseline, character, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "md5"))))
user system elapsed
79.100 0.581 75.368
> # baseline, character, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "xxh3_64"))))
user system elapsed
70.354 0.274 70.574
> # revised, character, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "md5"))))
user system elapsed
64.904 0.249 65.129
>
> # revised, character, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64"))))
user system elapsed
59.876 0.028 59.875
> # baseline, raw, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "md5", raw = TRUE))))
user system elapsed
57.435 0.060 58.310
>
> # baseline, raw, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "xxh3_64", raw = TRUE))))
user system elapsed
55.282 0.008 55.264
>
> # revised, raw, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "md5", raw = TRUE))))
user system elapsed
51.253 0.036 51.269
>
> # revised, raw, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64", raw = TRUE))))
user system elapsed
55.134 0.032 55.149
Have a look at package microbenchmark. It also has a handy plot (and 'autoplot') option. Easier to eyeball. Multiple runs.
> microbenchmark::microbenchmark(
+ md5chabase = sapply(randomwords, \(x) digest(x, algo = "md5")),
+ md5charev = sapply(randomwords, \(x) digest_alt(x, algo = "md5")),
+ md5rawbase = sapply(randomwords, \(x) digest(x, algo = "md5", raw = TRUE)),
+ md5rawrev = sapply(randomwords, \(x) digest_alt(x, algo = "md5", raw = TRUE)),
+ xxhchabase = sapply(randomwords, \(x) digest(x, algo = "xxh3_64")),
+ xxhcharev = sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64")),
+ xxhrawbase = sapply(randomwords, \(x) digest(x, algo = "xxh3_64", raw = TRUE)),
+ xxhrawrev = sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64", raw = TRUE))
+ )
Unit: milliseconds
expr min lq mean median uq max neval
md5chabase 5.310727 5.707952 7.356002 6.431015 7.722992 20.35156 100
md5charev 5.360650 5.832965 7.060423 6.500092 7.941240 17.50174 100
md5rawbase 5.013431 5.558714 6.830248 6.394128 7.326719 19.73746 100
md5rawrev 5.028035 5.533684 6.436786 6.023105 7.063990 17.99658 100
xxhchabase 5.129542 5.674817 6.842931 6.321524 7.541191 21.34342 100
xxhcharev 5.137640 5.662121 6.602645 5.795022 7.066211 16.56205 100
xxhrawbase 5.013435 5.568786 6.678245 6.351468 7.350908 16.11524 100
xxhrawrev 5.033589 5.556206 6.769648 6.116699 6.798559 19.34696 100
I'd call this a tie.
I'd call this a tie.
Still seems to me that there's too much overhead somewhere. Directly used, xxh is much faster. Agree won't get all of that, but want to keep looking for reductions to overhead. Going to put this on ice, and try to find opportunities in the R side.
Still seems to me that there's too much overhead somewhere.
It's possible. Maybe it is worth also looking at 'the other side' i.e. trying the most minimal, say, xxh use and then seeing what it takes to get it into R via a barebone interface? There are always some costs but also some we can't really avoid.
I am wondering if we should compare against something like fastdigest (which, if memory serves, mostly won by being vectorized, something we here got later by another clever PR.
Overall, methinks it would be nice to decompose into
so that we get a handle on may be improved in the third point as the other two are more or less given as is.
I did a bit of experimenting on the plane - not sure what precisely, but about 2/3rds the run time goes away when skipping straight to the C without any of the argument munging etc on the R side
(And then algorithm performance differences were a bit more visible).
Can't change the signature, including the error behavior, but maybe some of that can be pushed to the C code.
Maybe alternatively offer some other ways to call in from R, trading flexibility/error checking for speed? Then wrap that with the current digest
Poking at the R side performance:
Just fixing those basically gets rid of all the R noise. Then comparing new (this) vs old C implementation looks like this:
(these are bot/top 5% of benchmark times trimmed, so can use the non-log scale)
I love love love what you are doing here but your writeup is a little on the 'too dense' side (and my mind is currently 'at work' too. Can you expand a little?
Also a rebase seems to be in order as you appear to have conflict in src/digest.c
.
I'm going to update this PR with the above recommendations so we can both see code changes, then I'll add the benchmarking code here in comments?
I'm vaguely familiar with integrating benchmarking testing into a testthat testing approach - not sure how to do the idiomatic version with tinytest.
Yes incremental approach one (logical) changeset at a time seems fine. Maybe one per your bullet points above.
Can you show me a detail of timing comparison under testthat? I am only aware of the (seemingly quite involved) automated thing data.table does (using no test runner package). If something simple can be done without added dependencies then Mark (over at tinytest) may be all ears.
Yes incremental approach one (logical) changeset at a time seems fine. Maybe one per your bullet points above.
Can you show me a detail of timing comparison under testthat? I am only aware of the (seemingly quite involved) automated thing data.table does (using no test runner package). If something simple can be done without added dependencies then Mark (over at tinytest) may be all ears.
:+1:
re benchmarking: I'm familiar with adding benchmarks to an existing CI setup with e.g. epinowcast - honestly, no clue how testthat / touchstone does its magic, but i think the gist is storage of benchmark timings + recalculation of them for new PRs.
tinytest is simple and elegant: everything is just a script, and the evaluated predicates are aggregated. No more, no less, no headaches (in my view).
Methinks we are talking about adding benchmarking to CI which, again, is just scripting (albeit via the terrible yaml interface). My personal preference is generally to keep it easy and light and eg my default is often just one run under Linux as I fail to see the point of some of the suggested defaults of wasting cpu resources on testing trivial changes on, say, five R releases and three architectures. But each to their own.
For supplemental analyses I sometimes also fork or copy repos to bang harder at them from another angle (but not on each commit).
tinytest is simple and elegant: everything is just a script, and the evaluated predicates are aggregated. No more, no less, no headaches (in my view).
Methinks we are talking about adding benchmarking to CI which, again, is just scripting (albeit via the terrible yaml interface). My personal preference is generally to keep it easy and light and eg my default is often just one run under Linux as I fail to see the point of some of the suggested defaults of wasting cpu resources on testing trivial changes on, say, five R releases and three architectures. But each to their own.
For supplemental analyses I sometimes also fork or copy repos to bang harder at them from another angle (but not on each commit).
Mmm, so I think the tinytest benchmarking notion would then be roughly:
Also all of this seems distinctly not portable, since the benchmarking is machine specific. I'm comfortable that large-enough gains on a single platform are going to just be portable as improvements, but there's not really a good way to make the reference results (i.e. not the relative differences between results) portable, as those will also be machine specific.
I think sticking with benchmarking manually (barring finding someone that wants to do the CI overhaul for its own sake) is the right answer.
Let's may move benchmarking to a different issue. I have some similar thoughts -- we can compare 'current branch' versus the main branch. As one cannot have two different versions of the same package loaded one would indeed have to farm out and then collect compare. That could be generally useful.
One also needs, of course, a set of expressions to run. Maybe one could roxygen2-alike tag those in docs or examples, or in a new subdir.
Might ultimately use some of this code, but this PR seems like dead-end now.
Per #212, rough proposal.
Thoughts?