digest refactoring for performance

pearsonca commented 3 weeks ago

While working #210, I noticed opportunities for opportunites for consolidating / refactoring code in digest.c.

In some sense, digest.c is providing a Facade around interacting with the various quirks of the imported hashing libraries. It does some setup / reinterpretation of the input from R, uses a series of cases to handle each of the different algorithms, and then returns a consistent output.

It seems over time those cases have been internally implemented in slightly different ways (and I imagine, the hasher interfaces have changed over time after initial implementation). There is now a fair bit of repetitious code and/or inconsistent naming/etc that could be consolidated. This would make the assorted "Consider X" issues smaller lifts to accomplish, present opportunities for a bit of unit testing on the C side, benchmarking, etc - all the normal benefits of DRY.

I propose the following refactoring objectives:

[x] refactor macros to functions (per discussion in #210)
[ ] eliminate the duplicate code for streaming vs not approaches (e.g. by figuring out how to share the setup / return code, while switching between stream vs blob inputs)
[ ] more clearly distinguish "jobs" of various functions (e.g. I imagine digest would be as a more of traffic manager, with the individual hasher interface logic being modularized out; then adding a new hasher means adding a new consistently-surfaced-function + another case to digest and presto)
[ ] additional linting (e.g. spaces around binary operators)

Thoughts? Additional objectives? I think some of these (and possibly additional targets added via discussion) are entangled, but there's some potential to isolate at least some of these as their own PRs - preferences one way or another?

eddelbuettel commented 3 weeks ago

I am in favour of 1).

I am a little scared / see more downside than upside for 2) and 3) --- things work as they are, we manage to add new hasher so why rock the boat

I have so far refrained from larger-scale "style" only overhauls. I like when git blame refers to original contributions. [ I left your two-spaces indented lines alone too. ] I guess on the margin in favour but let's not get overboard.

Too little to keep you excited?

pearsonca commented 3 weeks ago

Can do a PR for just converting macros => functions, sure.

Then maybe draft something for one of the others, and see how it looks w/o (initially) super polishing it? If it seems promising, great, proceed to polish; if not, can always toss it.

pearsonca commented 2 weeks ago

Porting discussion from closed PRs #215 / #218 here:

any overhead reductions in digest need maintain code readability
better approach may be to introduce no-frills functionality

What could potentially work for that second item:

offering a direct (R) interface for each of the algorithms, with minimal parsing / error checking
refactor digest to do the error handling etc, then dispatch to the individual mode algorithm functions (this would invert the current relationship with e.g. sha1 R exports, so non-trivial work to do)
longer term that would suggest wanting to have more individualized wrappers in the C layer, but that wouldn't need to happen immediately
if pursing that longer term approach: also potentially makes more direct-to-C options available for dependent packages, if more done there to export the C layer

aside: the longer term approach would be advantageous for enabling people to implement their class-specific hashing extensions ala #23 and for potentially having a digestext or similar package, that offers custom class-specific extensions without bloating the core library.

eddelbuettel / digest

digest refactoring for performance #212