eddelbuettel / digest

R package to create compact hash digests of R objects
https://eddelbuettel.github.io/digest
111 stars 47 forks source link

convert digest macros to function #214

Closed pearsonca closed 3 weeks ago

pearsonca commented 3 weeks ago

Per #212 - converts _store... macros to functions.

pearsonca commented 3 weeks ago

This can just be done, but I also think there's a way here to pull lines 678-685 into the _store...

There's also a bit of benchmarking to be done to see if the SHA512 version of conversion to string output is faster than the snprintf version. Either way, the SHA512 output can start using the shared function - question is whether the shared function should do that version or not.

eddelbuettel commented 3 weeks ago

Overall this looks pretty good, and I like that you are incremental. Maybe we can tag pulling 678-685 in for another round to make things compartmentalized?

I may turn the reverse depdendency machine on, which I haven't for the previous PR yet, just to see where we are. Off for a morning run now, will look more later. Thanks for continuing to kick the can down the road!

pearsonca commented 3 weeks ago

There's also a bit of benchmarking to be done to see if the SHA512 version of conversion to string output is faster than the snprintf version. Either way, the SHA512 output can start using the shared function - question is whether the shared function should do that version or not.

Have poked at this a bit - confirm "classic" vs sha512 conversion to string is identical.

Hashing 10K random alphanumeric "words" (10 alphanumbers each), 100 times: md5 is mean 0.17834 (0.02043231 sd) with classic, 0.16378 (0.01895324 sd) with sha512 conversion, or ~8% reduction, but not a clear difference (due to uncertainty).

Same setup, but with xxhash32 (where the hashing bit should be faster): 0.15089 (0.01595315 sd) with classic vs 0.15346 (0.01753347 sd) with sha512 - so ~2% increase, but again unclear difference.

I have a general preference for do-things-the-same-way. Vaguely inclined to believe the SHA512 is version is actually faster, so consolidate to that version?

pearsonca commented 3 weeks ago

Aside: there seems to be almost no difference between the algoritm choices in terms of effective time as used in R. I suspect hash time would be more apparent hashing big inputs, with better amortization of the shared-R bit, but my particular use case is many-small-inputs.

Might be a case for some S3 specializations?

eddelbuettel commented 3 weeks ago

It's still early morning so I still do quite understand what you are actually after. I guess a more generic 'make me a string' function borrowing from what sha512 does?

In general I am also in favour of consistent approaches especially on our integration side, and extreme care about meddling with upstream code.

pearsonca commented 3 weeks ago

It's still early morning so I still do quite understand what you are actually after. In general I am also in favour of consistent approaches especially on our integration side, and extreme care about meddling with upstream code.

Sure; the commented section is the approach I extracted from several of case branches, but SHA512 case had a fancier approach (which is uncommented below). The commented and uncommented versions yield identical output.

void _store_from_char_ptr(
    const unsigned char * hash, unsigned char * const output,
    const size_t output_length, const int leaveRaw
) {
    if (leaveRaw) {
        memcpy(output, hash, output_length);
    } else {
        unsigned char *outputp = output;
        unsigned const char *d = hash;
        for (int j = 0; j < output_length; j++) {
//        snprintf(output + j * 2, 3, "%02x", hash[j]);
            *outputp++ = sha2_hex_digits[(*d & 0xf0) >> 4];
            *outputp++ = sha2_hex_digits[*d & 0x0f];
            d++;
        }
        *outputp = (char)0;
    }
}
eddelbuettel commented 3 weeks ago

In which case I prefer more transparent / more commonly understood code. snprintf() wins here if the performance is a tie.

(Benchmarking is a black hole, of course, as this will vary between compilers, standard library implementations, etc.)

eddelbuettel commented 3 weeks ago

Oh, we forgot a ChangeLog entry. Oh well. Next PR or just stick one here, I have a pending commit with two more ORCIDs anyway...

pearsonca commented 3 weeks ago

mmmm, something like: src/digest.c: switch output handling from macro to function; consolidate stringification loop approaches

eddelbuettel commented 3 weeks ago

Done in 0ca2463