Closed wlandau closed 6 months ago
Interesting. I suspect that this is related to the long (and for its OP, frustrating) discussion in #200. At the end of the day, digest
is a fairly straightforward 'collector' and 'dispatcher' of a given serialization string for a chosen hashing and "digesting" algorithm (among a moderatly large and complete selection of such algorithms).
So if and when you are in situations where the raw bytes from serialize()
differ, as I suspect they do here, there is not much we can do apart from pointing upstream ...
Recall in the different locales we may indeed be handed different strings from R by our own choices so seeing a difference strikes as quite plausible.
$ Rscript -e 'cat(serialize(1L, NULL, TRUE, version=3), "\n")'
41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 35 0a 55 54 46 2d 38 0a 31 33 0a 31 0a 31 0a
$ LANG="C" Rscript -e 'cat(serialize(1L, NULL, TRUE, version=3), "\n")'
41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 31 34 0a 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 0a 31 33 0a 31 0a 31 0a
$
If you use version=2
the issue seems to go away as you say.
Thanks for explaining.
We could possibly set up a more 'discerning digest' that strips what it cans. Might be worth discussing. I of course see why serialize()
does what it does and see that as good default for digest()
-- after all it should be a digest of what R thinks of an object and differ when things it differs -- but there may be reasons when we want just a subset.
But I don't right now see a way to strip something like LANG
without doing callr
gymnastics which may be a bridge too far. Any thoughts or ideas from your end?
PS One added complication is that some environment variables that govern the process are hard / impossible to alter once the process (for us: the R session) is running. Hm.
But I don't right now see a way to strip something like LANG without doing callr gymnastics which may be a bridge too far. Any thoughts or ideas from your end?
I'm afraid I don't understand enough about what's happening at this depth, but in secretbase
, @shikokuchuo apparently found a way to robustly remove headers without relying on a fixed number of bytes. C.f. https://github.com/shikokuchuo/secretbase/pull/5#issuecomment-1961736761, https://github.com/shikokuchuo/secretbase/pull/5#issuecomment-1962437554
Yes customizing consumption of what comes from serialize()
would be one way. Possibly not the lowest-risk approach, but possibly also the only one.
Note that I have 'borrowed' a snapshot of serialization API already in RApiSerialize()
(for use in Redis and other) so that may be a way too. But I won't have time to dig there anytime soon.
With
digest
version 0.6.34 and serialization version 3, @shikokuchuo observed that hashes on the same object may differ for different locales:@shikokuchuo figured out that the extra headers from serialization V3 are hashed along with the contents of the object. (So the hashes agree if
serialization = 2
orskip = 23
.)It would be nice to understand this choice for the default
skip = "auto"
. I am not sure if this is really adigest
issue because of how it relates toserialize()
.