eddelbuettel / digest

R package to create compact hash digests of R objects
https://eddelbuettel.github.io/digest
111 stars 47 forks source link

Understanding `skip = "auto"` with `serializeVersion = 3` #201

Closed wlandau closed 6 months ago

wlandau commented 6 months ago

With digest version 0.6.34 and serialization version 3, @shikokuchuo observed that hashes on the same object may differ for different locales:

$ LANG="C" R -q -e 'digest::digest(NULL, serializeVersion = 3, skip = "auto")'
> digest::digest(NULL, serializeVersion = 3, skip = "auto")
[1] "bdef078af943dd2546be047d2044d8b5"

$ R -q -e 'digest::digest(NULL, serializeVersion = 3, skip = "auto")'
> digest::digest(NULL, serializeVersion = 3, skip = "auto")
[1] "a611bfa70eb5dcc0a248ed0369794237"

@shikokuchuo figured out that the extra headers from serialization V3 are hashed along with the contents of the object. (So the hashes agree if serialization = 2 or skip = 23.)

It would be nice to understand this choice for the default skip = "auto". I am not sure if this is really a digest issue because of how it relates to serialize().

eddelbuettel commented 6 months ago

Interesting. I suspect that this is related to the long (and for its OP, frustrating) discussion in #200. At the end of the day, digest is a fairly straightforward 'collector' and 'dispatcher' of a given serialization string for a chosen hashing and "digesting" algorithm (among a moderatly large and complete selection of such algorithms).

So if and when you are in situations where the raw bytes from serialize() differ, as I suspect they do here, there is not much we can do apart from pointing upstream ...

Recall in the different locales we may indeed be handed different strings from R by our own choices so seeing a difference strikes as quite plausible.

$ Rscript -e 'cat(serialize(1L, NULL, TRUE, version=3), "\n")'
41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 35 0a 55 54 46 2d 38 0a 31 33 0a 31 0a 31 0a 
$ LANG="C" Rscript -e 'cat(serialize(1L, NULL, TRUE, version=3), "\n")'
41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 31 34 0a 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 0a 31 33 0a 31 0a 31 0a 
$ 

If you use version=2 the issue seems to go away as you say.

wlandau commented 6 months ago

Thanks for explaining.

eddelbuettel commented 6 months ago

We could possibly set up a more 'discerning digest' that strips what it cans. Might be worth discussing. I of course see why serialize() does what it does and see that as good default for digest() -- after all it should be a digest of what R thinks of an object and differ when things it differs -- but there may be reasons when we want just a subset.

But I don't right now see a way to strip something like LANG without doing callr gymnastics which may be a bridge too far. Any thoughts or ideas from your end?

PS One added complication is that some environment variables that govern the process are hard / impossible to alter once the process (for us: the R session) is running. Hm.

wlandau commented 6 months ago

But I don't right now see a way to strip something like LANG without doing callr gymnastics which may be a bridge too far. Any thoughts or ideas from your end?

I'm afraid I don't understand enough about what's happening at this depth, but in secretbase, @shikokuchuo apparently found a way to robustly remove headers without relying on a fixed number of bytes. C.f. https://github.com/shikokuchuo/secretbase/pull/5#issuecomment-1961736761, https://github.com/shikokuchuo/secretbase/pull/5#issuecomment-1962437554

eddelbuettel commented 6 months ago

Yes customizing consumption of what comes from serialize() would be one way. Possibly not the lowest-risk approach, but possibly also the only one.

Note that I have 'borrowed' a snapshot of serialization API already in RApiSerialize() (for use in Redis and other) so that may be a way too. But I won't have time to dig there anytime soon.