reduce unnecessary allocations; DRY out file processing

pearsonca commented 2 months ago

Per #212, rough proposal.

reduces unecessary allocation of char[128]
... which leads to needing to refactor approach to file-based versions
...and pulling the back-to-R elements into the cross-case functions.

Thoughts?

eddelbuettel commented 2 months ago

I am not (yet ?) convinced we are making net gains here.

pearsonca commented 2 months ago

In terms of the next-level consolidation, can consider md5:

        md5_context ctx;
        output_length = 16; // produces 16*8 = 128 bits
        unsigned char md5sum[output_length];

        md5_starts( &ctx );
        md5_update( &ctx, txt, nChar);
        md5_finish( &ctx, md5sum );

        return _store_from_char_ptr(md5sum, output_length, leaveRaw);

vs

        md5_context ctx;
        output_length = 16;
        unsigned char md5sum[output_length];

        md5_starts( &ctx );
        FILEHASH(md5_update( &ctx, buf, nChar ));
        md5_finish( &ctx, md5sum );

        return _store_from_char_ptr(md5sum, output_length, leaveRaw);

... the other cases are similarly basically-just-the-same, so some logic on file-vs-not could eliminate the need for the 100+ cases.

pearsonca commented 2 months ago

I am not (yet ?) convinced we are making net gains here.

Fair - I'm going to make some benchmarking checks here, though I know those are imperfect.

(though end of day here, so not for a bit)

pearsonca commented 2 months ago

A little performance testing; summarily, this PR version (digest_alt in this comparison) seems reliably better (only the last comparison bucks trend). Around ~10%+ reduction for my use case (lots of small hashes). I suspect there'd be relatively low benefit for hashing, say, files.

> randomwords <- replicate(1e4, paste0(sample(c(letters, LETTERS, 0:9), 20), collapse = ""))
> # baseline, character, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "md5"))))
   user  system elapsed 
 79.100   0.581  75.368 
> # baseline, character, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "xxh3_64"))))
   user  system elapsed 
 70.354   0.274  70.574 
> # revised, character, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "md5"))))
   user  system elapsed 
 64.904   0.249  65.129 
> 
> # revised, character, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64"))))
   user  system elapsed 
 59.876   0.028  59.875 
> # baseline, raw, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "md5", raw = TRUE))))
   user  system elapsed 
 57.435   0.060  58.310 
> 
> # baseline, raw, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "xxh3_64", raw = TRUE))))
   user  system elapsed 
 55.282   0.008  55.264 
> 
> # revised, raw, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "md5", raw = TRUE))))
   user  system elapsed 
 51.253   0.036  51.269 
> 
> # revised, raw, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64", raw = TRUE))))
   user  system elapsed 
 55.134   0.032  55.149

eddelbuettel commented 2 months ago

Have a look at package microbenchmark. It also has a handy plot (and 'autoplot') option. Easier to eyeball. Multiple runs.

pearsonca commented 2 months ago

> microbenchmark::microbenchmark(
+   md5chabase = sapply(randomwords, \(x) digest(x, algo = "md5")),
+   md5charev = sapply(randomwords, \(x) digest_alt(x, algo = "md5")),
+   md5rawbase = sapply(randomwords, \(x) digest(x, algo = "md5", raw = TRUE)),
+   md5rawrev = sapply(randomwords, \(x) digest_alt(x, algo = "md5", raw = TRUE)),
+   xxhchabase = sapply(randomwords, \(x) digest(x, algo = "xxh3_64")),
+   xxhcharev = sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64")),
+   xxhrawbase = sapply(randomwords, \(x) digest(x, algo = "xxh3_64", raw = TRUE)),
+   xxhrawrev = sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64", raw = TRUE))
+ )
Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
 md5chabase 5.310727 5.707952 7.356002 6.431015 7.722992 20.35156   100
  md5charev 5.360650 5.832965 7.060423 6.500092 7.941240 17.50174   100
 md5rawbase 5.013431 5.558714 6.830248 6.394128 7.326719 19.73746   100
  md5rawrev 5.028035 5.533684 6.436786 6.023105 7.063990 17.99658   100
 xxhchabase 5.129542 5.674817 6.842931 6.321524 7.541191 21.34342   100
  xxhcharev 5.137640 5.662121 6.602645 5.795022 7.066211 16.56205   100
 xxhrawbase 5.013435 5.568786 6.678245 6.351468 7.350908 16.11524   100
  xxhrawrev 5.033589 5.556206 6.769648 6.116699 6.798559 19.34696   100

pearsonca commented 2 months ago

Rplot

eddelbuettel commented 2 months ago

I'd call this a tie.

pearsonca commented 2 months ago

I'd call this a tie.

Still seems to me that there's too much overhead somewhere. Directly used, xxh is much faster. Agree won't get all of that, but want to keep looking for reductions to overhead. Going to put this on ice, and try to find opportunities in the R side.

eddelbuettel commented 2 months ago

Still seems to me that there's too much overhead somewhere.

It's possible. Maybe it is worth also looking at 'the other side' i.e. trying the most minimal, say, xxh use and then seeing what it takes to get it into R via a barebone interface? There are always some costs but also some we can't really avoid.

eddelbuettel commented 2 months ago

I am wondering if we should compare against something like fastdigest (which, if memory serves, mostly won by being vectorized, something we here got later by another clever PR.

Overall, methinks it would be nice to decompose into

overhead of a C(++) function from R, the .Call interface, etc
the cost of the actual hashers
"whatever else we do"

so that we get a handle on may be improved in the third point as the other two are more or less given as is.

pearsonca commented 2 months ago

I did a bit of experimenting on the plane - not sure what precisely, but about 2/3rds the run time goes away when skipping straight to the C without any of the argument munging etc on the R side

(And then algorithm performance differences were a bit more visible).

Can't change the signature, including the error behavior, but maybe some of that can be pushed to the C code.

Maybe alternatively offer some other ways to call in from R, trading flexibility/error checking for speed? Then wrap that with the current digest

pearsonca commented 2 months ago

Poking at the R side performance:

[ ] ~1/4 of call: the match.args + function => switch case approach is 2-3x more expensive that somelist[[algo]] + check with is.null. Recommend: use list lookup + null (more invasive but probably better performance: do the check/conversion/errors in C)
[ ] ~1/4 of the call: (ibid errormode). Recommend pushing the match arg work into the actual error function instead
[ ] ~1/4 of the call: is.infinite(length) => length = -1 conversion. recommend: use length = -1 as the default, & update the documents. However, this is interface breaking, if people are dynamically selecting something vs Inf
(rest of the call is the .Call)

Just fixing those basically gets rid of all the R noise. Then comparing new (this) vs old C implementation looks like this:

Rplot01

(these are bot/top 5% of benchmark times trimmed, so can use the non-log scale)

eddelbuettel commented 2 months ago

I love love love what you are doing here but your writeup is a little on the 'too dense' side (and my mind is currently 'at work' too. Can you expand a little?

Also a rebase seems to be in order as you appear to have conflict in src/digest.c.

pearsonca commented 2 months ago

I'm going to update this PR with the above recommendations so we can both see code changes, then I'll add the benchmarking code here in comments?

I'm vaguely familiar with integrating benchmarking testing into a testthat testing approach - not sure how to do the idiomatic version with tinytest.

eddelbuettel commented 2 months ago

Yes incremental approach one (logical) changeset at a time seems fine. Maybe one per your bullet points above.

Can you show me a detail of timing comparison under testthat? I am only aware of the (seemingly quite involved) automated thing data.table does (using no test runner package). If something simple can be done without added dependencies then Mark (over at tinytest) may be all ears.

pearsonca commented 2 months ago

Yes incremental approach one (logical) changeset at a time seems fine. Maybe one per your bullet points above.

Can you show me a detail of timing comparison under testthat? I am only aware of the (seemingly quite involved) automated thing data.table does (using no test runner package). If something simple can be done without added dependencies then Mark (over at tinytest) may be all ears.

:+1:

re benchmarking: I'm familiar with adding benchmarks to an existing CI setup with e.g. epinowcast - honestly, no clue how testthat / touchstone does its magic, but i think the gist is storage of benchmark timings + recalculation of them for new PRs.

eddelbuettel commented 2 months ago

tinytest is simple and elegant: everything is just a script, and the evaluated predicates are aggregated. No more, no less, no headaches (in my view).

Methinks we are talking about adding benchmarking to CI which, again, is just scripting (albeit via the terrible yaml interface). My personal preference is generally to keep it easy and light and eg my default is often just one run under Linux as I fail to see the point of some of the suggested defaults of wasting cpu resources on testing trivial changes on, say, five R releases and three architectures. But each to their own.

For supplemental analyses I sometimes also fork or copy repos to bang harder at them from another angle (but not on each commit).

pearsonca commented 2 months ago

tinytest is simple and elegant: everything is just a script, and the evaluated predicates are aggregated. No more, no less, no headaches (in my view).

Methinks we are talking about adding benchmarking to CI which, again, is just scripting (albeit via the terrible yaml interface). My personal preference is generally to keep it easy and light and eg my default is often just one run under Linux as I fail to see the point of some of the suggested defaults of wasting cpu resources on testing trivial changes on, say, five R releases and three architectures. But each to their own.

For supplemental analyses I sometimes also fork or copy repos to bang harder at them from another angle (but not on each commit).

Mmm, so I think the tinytest benchmarking notion would then be roughly:

have a way to store benchmark outcomes - some formal structure for file cached in inst/tinytest. aside: a bit wary of the inst-based approach generally, as that's putting unnecessary bits on all users, when they're only pertinent to devs.
use the tinytest predicates to compare gain / loss of benchmark performance - not really an error to be slower, so might need a bit of conceptual / actual tweaking here (I guess ... t-tests on detectably different distributions of performance, with specified certainty, etc?)
incrementally update benchmarks (with improving performance, accepted losses, etc) - but that means every accepted commit has to come w/ changes to the tinytest reference chunk (after deciding to accept it)

Also all of this seems distinctly not portable, since the benchmarking is machine specific. I'm comfortable that large-enough gains on a single platform are going to just be portable as improvements, but there's not really a good way to make the reference results (i.e. not the relative differences between results) portable, as those will also be machine specific.

I think sticking with benchmarking manually (barring finding someone that wants to do the CI overhaul for its own sake) is the right answer.

eddelbuettel commented 2 months ago

Let's may move benchmarking to a different issue. I have some similar thoughts -- we can compare 'current branch' versus the main branch. As one cannot have two different versions of the same package loaded one would indeed have to farm out and then collect compare. That could be generally useful.

One also needs, of course, a set of expressions to run. Maybe one could roxygen2-alike tag those in docs or examples, or in a new subdir.

pearsonca commented 2 months ago

Might ultimately use some of this code, but this PR seems like dead-end now.

eddelbuettel / digest

reduce unnecessary allocations; DRY out file processing #215