eddelbuettel / digest

R package to create compact hash digests of R objects
https://eddelbuettel.github.io/digest
111 stars 47 forks source link

reduce unnecessary allocations; DRY out file processing #215

Closed pearsonca closed 2 months ago

pearsonca commented 2 months ago

Per #212, rough proposal.

Thoughts?

eddelbuettel commented 2 months ago

I am not (yet ?) convinced we are making net gains here.

pearsonca commented 2 months ago

In terms of the next-level consolidation, can consider md5:

        md5_context ctx;
        output_length = 16; // produces 16*8 = 128 bits
        unsigned char md5sum[output_length];

        md5_starts( &ctx );
        md5_update( &ctx, txt, nChar);
        md5_finish( &ctx, md5sum );

        return _store_from_char_ptr(md5sum, output_length, leaveRaw);

vs

        md5_context ctx;
        output_length = 16;
        unsigned char md5sum[output_length];

        md5_starts( &ctx );
        FILEHASH(md5_update( &ctx, buf, nChar ));
        md5_finish( &ctx, md5sum );

        return _store_from_char_ptr(md5sum, output_length, leaveRaw);

... the other cases are similarly basically-just-the-same, so some logic on file-vs-not could eliminate the need for the 100+ cases.

pearsonca commented 2 months ago

I am not (yet ?) convinced we are making net gains here.

Fair - I'm going to make some benchmarking checks here, though I know those are imperfect.

(though end of day here, so not for a bit)

pearsonca commented 2 months ago

A little performance testing; summarily, this PR version (digest_alt in this comparison) seems reliably better (only the last comparison bucks trend). Around ~10%+ reduction for my use case (lots of small hashes). I suspect there'd be relatively low benefit for hashing, say, files.

> randomwords <- replicate(1e4, paste0(sample(c(letters, LETTERS, 0:9), 20), collapse = ""))
> # baseline, character, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "md5"))))
   user  system elapsed 
 79.100   0.581  75.368 
> # baseline, character, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "xxh3_64"))))
   user  system elapsed 
 70.354   0.274  70.574 
> # revised, character, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "md5"))))
   user  system elapsed 
 64.904   0.249  65.129 
> 
> # revised, character, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64"))))
   user  system elapsed 
 59.876   0.028  59.875 
> # baseline, raw, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "md5", raw = TRUE))))
   user  system elapsed 
 57.435   0.060  58.310 
> 
> # baseline, raw, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest(x, algo = "xxh3_64", raw = TRUE))))
   user  system elapsed 
 55.282   0.008  55.264 
> 
> # revised, raw, md5
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "md5", raw = TRUE))))
   user  system elapsed 
 51.253   0.036  51.269 
> 
> # revised, raw, xxh3
> system.time(replicate(100, sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64", raw = TRUE))))
   user  system elapsed 
 55.134   0.032  55.149 
eddelbuettel commented 2 months ago

Have a look at package microbenchmark. It also has a handy plot (and 'autoplot') option. Easier to eyeball. Multiple runs.

pearsonca commented 2 months ago
> microbenchmark::microbenchmark(
+   md5chabase = sapply(randomwords, \(x) digest(x, algo = "md5")),
+   md5charev = sapply(randomwords, \(x) digest_alt(x, algo = "md5")),
+   md5rawbase = sapply(randomwords, \(x) digest(x, algo = "md5", raw = TRUE)),
+   md5rawrev = sapply(randomwords, \(x) digest_alt(x, algo = "md5", raw = TRUE)),
+   xxhchabase = sapply(randomwords, \(x) digest(x, algo = "xxh3_64")),
+   xxhcharev = sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64")),
+   xxhrawbase = sapply(randomwords, \(x) digest(x, algo = "xxh3_64", raw = TRUE)),
+   xxhrawrev = sapply(randomwords, \(x) digest_alt(x, algo = "xxh3_64", raw = TRUE))
+ )
Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
 md5chabase 5.310727 5.707952 7.356002 6.431015 7.722992 20.35156   100
  md5charev 5.360650 5.832965 7.060423 6.500092 7.941240 17.50174   100
 md5rawbase 5.013431 5.558714 6.830248 6.394128 7.326719 19.73746   100
  md5rawrev 5.028035 5.533684 6.436786 6.023105 7.063990 17.99658   100
 xxhchabase 5.129542 5.674817 6.842931 6.321524 7.541191 21.34342   100
  xxhcharev 5.137640 5.662121 6.602645 5.795022 7.066211 16.56205   100
 xxhrawbase 5.013435 5.568786 6.678245 6.351468 7.350908 16.11524   100
  xxhrawrev 5.033589 5.556206 6.769648 6.116699 6.798559 19.34696   100
pearsonca commented 2 months ago

Rplot

eddelbuettel commented 2 months ago

I'd call this a tie.

pearsonca commented 2 months ago

I'd call this a tie.

Still seems to me that there's too much overhead somewhere. Directly used, xxh is much faster. Agree won't get all of that, but want to keep looking for reductions to overhead. Going to put this on ice, and try to find opportunities in the R side.

eddelbuettel commented 2 months ago

Still seems to me that there's too much overhead somewhere.

It's possible. Maybe it is worth also looking at 'the other side' i.e. trying the most minimal, say, xxh use and then seeing what it takes to get it into R via a barebone interface? There are always some costs but also some we can't really avoid.

eddelbuettel commented 2 months ago

I am wondering if we should compare against something like fastdigest (which, if memory serves, mostly won by being vectorized, something we here got later by another clever PR.

Overall, methinks it would be nice to decompose into

so that we get a handle on may be improved in the third point as the other two are more or less given as is.

pearsonca commented 2 months ago

I did a bit of experimenting on the plane - not sure what precisely, but about 2/3rds the run time goes away when skipping straight to the C without any of the argument munging etc on the R side

(And then algorithm performance differences were a bit more visible).

Can't change the signature, including the error behavior, but maybe some of that can be pushed to the C code.

Maybe alternatively offer some other ways to call in from R, trading flexibility/error checking for speed? Then wrap that with the current digest

pearsonca commented 2 months ago

Poking at the R side performance:

Just fixing those basically gets rid of all the R noise. Then comparing new (this) vs old C implementation looks like this:

Rplot01

(these are bot/top 5% of benchmark times trimmed, so can use the non-log scale)

eddelbuettel commented 2 months ago

I love love love what you are doing here but your writeup is a little on the 'too dense' side (and my mind is currently 'at work' too. Can you expand a little?

Also a rebase seems to be in order as you appear to have conflict in src/digest.c.

pearsonca commented 2 months ago

I'm going to update this PR with the above recommendations so we can both see code changes, then I'll add the benchmarking code here in comments?

I'm vaguely familiar with integrating benchmarking testing into a testthat testing approach - not sure how to do the idiomatic version with tinytest.

eddelbuettel commented 2 months ago

Yes incremental approach one (logical) changeset at a time seems fine. Maybe one per your bullet points above.

Can you show me a detail of timing comparison under testthat? I am only aware of the (seemingly quite involved) automated thing data.table does (using no test runner package). If something simple can be done without added dependencies then Mark (over at tinytest) may be all ears.

pearsonca commented 2 months ago

Yes incremental approach one (logical) changeset at a time seems fine. Maybe one per your bullet points above.

Can you show me a detail of timing comparison under testthat? I am only aware of the (seemingly quite involved) automated thing data.table does (using no test runner package). If something simple can be done without added dependencies then Mark (over at tinytest) may be all ears.

:+1:

re benchmarking: I'm familiar with adding benchmarks to an existing CI setup with e.g. epinowcast - honestly, no clue how testthat / touchstone does its magic, but i think the gist is storage of benchmark timings + recalculation of them for new PRs.

eddelbuettel commented 2 months ago

tinytest is simple and elegant: everything is just a script, and the evaluated predicates are aggregated. No more, no less, no headaches (in my view).

Methinks we are talking about adding benchmarking to CI which, again, is just scripting (albeit via the terrible yaml interface). My personal preference is generally to keep it easy and light and eg my default is often just one run under Linux as I fail to see the point of some of the suggested defaults of wasting cpu resources on testing trivial changes on, say, five R releases and three architectures. But each to their own.

For supplemental analyses I sometimes also fork or copy repos to bang harder at them from another angle (but not on each commit).

pearsonca commented 2 months ago

tinytest is simple and elegant: everything is just a script, and the evaluated predicates are aggregated. No more, no less, no headaches (in my view).

Methinks we are talking about adding benchmarking to CI which, again, is just scripting (albeit via the terrible yaml interface). My personal preference is generally to keep it easy and light and eg my default is often just one run under Linux as I fail to see the point of some of the suggested defaults of wasting cpu resources on testing trivial changes on, say, five R releases and three architectures. But each to their own.

For supplemental analyses I sometimes also fork or copy repos to bang harder at them from another angle (but not on each commit).

Mmm, so I think the tinytest benchmarking notion would then be roughly:

Also all of this seems distinctly not portable, since the benchmarking is machine specific. I'm comfortable that large-enough gains on a single platform are going to just be portable as improvements, but there's not really a good way to make the reference results (i.e. not the relative differences between results) portable, as those will also be machine specific.

I think sticking with benchmarking manually (barring finding someone that wants to do the CI overhaul for its own sake) is the right answer.

eddelbuettel commented 2 months ago

Let's may move benchmarking to a different issue. I have some similar thoughts -- we can compare 'current branch' versus the main branch. As one cannot have two different versions of the same package loaded one would indeed have to farm out and then collect compare. That could be generally useful.

One also needs, of course, a set of expressions to run. Maybe one could roxygen2-alike tag those in docs or examples, or in a new subdir.

pearsonca commented 2 months ago

Might ultimately use some of this code, but this PR seems like dead-end now.