digest.R optimization 1: hash algo lookup

eddelbuettel / digest

R package to create compact hash digests of R objects

https://eddelbuettel.github.io/digest

111 stars 47 forks source link

digest.R optimization 1: hash algo lookup #218

Closed pearsonca closed 2 months ago

pearsonca commented 2 months ago

Extracted from #215

pearsonca commented 2 months ago

profvis::profvis(replicate(1000, digest("8675309", algo = "md5", serialize = FALSE, raw = TRUE)))

indicated 4 hot spots: (1) converting algo, (2) converting errormode, (3) converting length, and (4) running the algorithm. This addresses (1)

pearsonca commented 2 months ago

(also, deferring the ChangeLog update to (3)?)

eddelbuettel commented 2 months ago

Let's please get back down to the convention that every PR has a ChangeLog entry. Thank you.

pearsonca commented 2 months ago

Alright, having sorted out some what's installed where, what session are we in issues:

randomwords <- sample(c(LETTERS, letters, 0:9), 1e6, replace = TRUE)
randomword <- paste0(randomwords, collapse = "")
profvis::profvis(lapply(randomwords, \(w) digest(w, algo = "md5", serialize = FALSE, raw = TRUE)))
profvis::profvis(replicate(1e3, digest(randomword, algo = "md5", serialize = FALSE, raw = TRUE)))

The second one (a lot of really little hashes) is interesting (and this is with this revision; prior to this, algo-related lines totalled up to ~what's seen here for errormode) Screenshot from 2024-08-26 09-47-41

The first one (a few big hashes), not really. the line translating errormode does actually show up, but corresponds to <1% of the sampled run time.

eddelbuettel commented 2 months ago

Sorry, been tied up in other things. Now, I may need to mull this over a little bit more but I have the feeling that we are making 'mostly light and no heat here'. I am not convinced yet we need a change for performance: I can live with time spent on match.arg() in a perfectly valid and common code idiom. Also, as noticed, at least one of the suggested changes reduces code clarity in my eyes.

pearsonca commented 2 months ago

Sorry, been tied up in other things. Now, I may need to mull this over a little bit more but I have the feeling that we are making 'mostly light and no hear here'. I am not convinced yet we need a change for performance: I can live with time spent on match.arg() in a perfectly valid and common code idiom. Also, as noticed, at least one of the suggested changes reduces code clarity in my eyes.

I'll standby for further discussion then. From my end, for {digest} to work for my case it needs to be much faster on the small-hashes (and that seems to require mostly work on the R overhead), but there are other approaches I can use (e.g. just wrapping xxh in that package).

BREAK

I do think the errormode version of solving the performance problem could just be pushing the match.arg into the error checking function, without loss of code clarity. I can do that as a PR, if you like.

eddelbuettel commented 2 months ago

and that seems to require mostly work on the R overhead

I said that earlier in passing: maybe it is worth trying an alternate, no-frills accessor.

How the vectorised access was added was quite nice, maybe we could think of something similar here without requiring baby and bathwater approaches.

I do think the errormode version of solving the performance problem could just be pushing the match.arg into the error checking function, without loss of code clarity. I can do that as a PR, if you like.

That has merit too. We would then not spend those microseconds checking an argument we do not use.

In short, let's try to refocus on a few, clear, simple, unambiguous improvements.

Oh and another approach we could consider (and something I now know better than when this started) is to export C-level accessors directly to other packages. Even less overhead.

pearsonca commented 2 months ago

Alright, initiated an trim to the errormode checking, since that's just-better.

Re no frills: how about an xxh(...) function along the lines of the sha() functionality? The potential problem there is that all the sha1(...) functions ultimately pass to digest, and this would be a different approach.