Closed cboettig closed 4 years ago
In my mind, a compressed file A and uncompressed file B are expected to have different content hashes. However, they are related to each other in that file A was generated by compressing file B.
Also, text file X that uses DOS endings (\r\n
) and text file Y that uses unix endings (\n
) are expected to have different content hashes. And, file X can be generated from file Y by inserting \r
every time a \n
occurs.
Perhaps this highlights the difference between a content hash and a semantic hash. A content hash is format agnostic and only looks at raw bits 'n bytes. A semantic hash generates a unique hash based on the information in a certain dataset. In the semantic hash case, files A, B, X and Y may have different content hashes, but identical semantic hashes, because the files are formatted differently, but carry the exact same information.
In practice, I found that semantic hashes are hard to implement. I think that it'd be easier to use content hashes and use relationships to make the relationships between datasets explicit (e.g., file A was generated from the compression of file B).
All in all, this tells me the importance of stores providing the raw data streams/feeds (bits 'n bytes) as opposed to some derived, platform dependent, content stream/feed.
@cboettig curious to hear your thoughts.
Thanks for the feedback here. Yes, agree with all of this. In particular: yes, having a 'semantic hash' would be great but implementation is unclear, meanwhile I think content hash is much more concrete and so happy to focus on that. I agree compressed/uncompressed are different content, they have different hashes, so no problem there.
I agree DOS/non-DOS are different serializations as well, \r\n
is literally a different bit than \n
. There's no question of ideology, I'm only worried about nuts & bolts implementation here. e.g. cases where I thought we were talking about the same bits 'n bytes but when in fact they were different bits'n'bytes. For instance, here Windows gives a different hash for the the file that is right here in this repo. Why is Windows giving that hash? Windows checked out the file from GitHub here. I'm not sure it actually has anything to do with DOS line endings, that was just a guess (since some windows machines are set to convert line endings on git clone).
Also, the examples above -- I'm just trying to wrap my head around the nuts and bolts of what functions & options give me the "raw data stream bits 'n bytes" and what are actually giving me some "derived, platform dependent, content stream/feed". (e.g. on the face of it, I had thought sha256(file(f, open = "rt", raw = TRUE))
or sha256(file(f, open = "r", raw = TRUE))
might be the proper arguments for the "raw bits and bytes", but my experimentation above suggests otherwise).
Since users never 'see' raw 1s and 0s, it seems the key here is to avoid situations where a user thinks they are talking about the "same" raw data stream, when in fact one user is talking about sha256(file(f, open = "r", raw = TRUE))
instead of sha256(file(f, open = "rb", raw = TRUE))
.
So, is there some test or assertion we can run that tells us we are working with 'raw bits and bytes' and not something else?
e.g. as a further wrinkle: "opening" the connection also changes the hash of the content stream:
library(openssl)
f <- system.file("extdata", "vostok.icecore.co2.gz",
package = "contenturi", mustWork = TRUE
)
con <- file(f, "", raw = FALSE)
con
#> A connection with
#> description "/usr/local/lib/R/site-library/contenturi/extdata/vostok.icecore.co2.gz"
#> class "gzfile"
#> mode "rt"
#> text "text"
#> opened "closed"
#> can read "yes"
#> can write "yes"
sha256(con)
#> sha256 94:12:32:58:31:da:b2:2a:ee:bd:d6:74:b6:eb:53:ba:6b:7b:dd:04:bb:99:a4:db:b2:1d:df:f6:46:28:7e:37
con <- file(f, "", raw = FALSE)
open(con, open = "", raw = FALSE)
con
#> A connection with
#> description "/usr/local/lib/R/site-library/contenturi/extdata/vostok.icecore.co2.gz"
#> class "gzfile"
#> mode "rt"
#> text "text"
#> opened "opened"
#> can read "yes"
#> can write "no"
sha256(con)
#> sha256 43:0a:20:e2:e9:6b:66:0c:2a:30:b3:d9:f4:23:45:17:20:2c:09:9a:1d:b0:92:b3:c5:df:53:2d:f0:48:a0:9d
Created on 2020-03-02 by the reprex package (v0.3.0)
Cool to see that you can pickup the mangling of bits and bytes by the various R/operating system combinations in the unit/integration tests. I think it just goes to highlight that bits and bytes are being manipulated in places that you'd expect would be just pass through. Am curious to see what the root cause of manipulation is. Chopped off EOF ? Or "smart" decompression? Text encoding?
me too. how do you create four unique content hashes from the above? Obviously you get two from compression, but I don't see why you can get two more by toggling between "read binary" (rb
) and read text (rt
). I don't have a theory of the case for that yet.
@cboettig read text sounds like a text encoding issue. Unless you explicitly specify which text encoding (e.g., UTF8, ASCII) to use to translate binary into text, the underlying code will choose the default one. For most linux, the default is UTF8, I am not quite sure what the default for modern windows is.
Also, the sha256 function operating on text would have to translate the text back to binary somehow, posing the same issue.
So, for both the rt
and sha256 on text, you should have a way to set the text encoding. This setting is usually hidden to not annoy most users too much.
Curious to hear whether the text encoding theory holds up.
Perhaps encoding could create differences when run on Windows vs other platforms, but the four variations shown above are all illustrated on Linux. Here's a linux example where we also explicitly set the file encoding to UTF-8. It only changes the results relative to sha256(file(f, open = "rt", raw = TRUE))
case, where adding UTF-8 gives a different hash but also throws a warning. in other cases adding `encoding="UTF-8" doesn't change things:
library(contenturi)
library(openssl)
f <- system.file("extdata", "vostok.icecore.co2.gz",
package = "contenturi", mustWork = TRUE
)
## Default settings, equal to shasum of uncompressed file
## n.b. cannot figure out what mode is equivalent to `open = ""`
## encoding term doesn't matter here
sha256(file(f, "", raw = FALSE))
#> sha256 94:12:32:58:31:da:b2:2a:ee:bd:d6:74:b6:eb:53:ba:6b:7b:dd:04:bb:99:a4:db:b2:1d:df:f6:46:28:7e:37
sha256(file(f, "", encoding = "UTF-8", raw = FALSE))
#> sha256 94:12:32:58:31:da:b2:2a:ee:bd:d6:74:b6:eb:53:ba:6b:7b:dd:04:bb:99:a4:db:b2:1d:df:f6:46:28:7e:37
## matches the 'standard' sha256sum of compressed file
sha256(file(f, "rb", raw = TRUE))
#> sha256 93:62:a6:10:24:37:bf:f5:ea:50:89:88:42:6d:52:74:a8:ad:df:db:11:a6:03:d0:16:a7:b3:05:cf:66:86:8f
sha256(file(f, "rb", encoding = "UTF-8", raw = TRUE))
#> sha256 93:62:a6:10:24:37:bf:f5:ea:50:89:88:42:6d:52:74:a8:ad:df:db:11:a6:03:d0:16:a7:b3:05:cf:66:86:8f
## Encoding does matter with text format
sha256(file(f, open = "rt", raw = TRUE))
#> sha256 14:d0:da:75:29:e9:fd:de:29:31:39:67:ed:32:c2:dc:23:42:ac:26:15:6c:58:51:87:6d:5b:52:fb:31:1e:cb
sha256(file(f, open = "rt", raw = FALSE))
#> sha256 43:0a:20:e2:e9:6b:66:0c:2a:30:b3:d9:f4:23:45:17:20:2c:09:9a:1d:b0:92:b3:c5:df:53:2d:f0:48:a0:9d
sha256(file(f, open = "rt", encoding = "UTF-8", raw = TRUE))
#> Warning in readLines(con, n = 1L, warn = FALSE): invalid input found
#> on input connection '/usr/local/lib/R/site-library/contenturi/extdata/
#> vostok.icecore.co2.gz'
#> sha256 ff:e6:79:bb:83:1c:95:b6:7d:c1:78:19:c6:3c:50:90:d2:21:aa:c6:f4:c7:bf:53:0f:59:4a:b4:3d:21:fa:1e
sha256(file(f, open = "rt", encoding = "UTF-8", raw = FALSE))
#> sha256 43:0a:20:e2:e9:6b:66:0c:2a:30:b3:d9:f4:23:45:17:20:2c:09:9a:1d:b0:92:b3:c5:df:53:2d:f0:48:a0:9d
Created on 2020-03-03 by the reprex package (v0.3.0)
Note these should be easy to reproduce, and the file I/O on R is documented decently, see https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/connections might be helpful. We can also set encoding
to bytes
instead of, say, UTF-8, which of course throws errors if the open mode is rt
.
The plot thickens . . . good that we are stumbling across this now and am eager to find the root cause. Hoping to find some time to reproduce and play around with it this week.
Closed by #22
Functions dealing with "content" should technically work with content as streams rather than files.
This mostly effects the
content_uri()
function, and also the localstore()
. One option is to allow these functions to take either a content stream (a "connection", see?file
) but also take a local path and convert that to a stream (this is commonly done by other R functions, e.g. the base R functionsread.table
etc). We can also access remote content as a stream, e.g.curl::curl("http://example.com")
or from base R,url("http://example.com")
.There is also the related question of what exactly is a "content stream"? For instance, a
file()
connection in R has several options for how it may be parsed, includingraw = TRUE
or FALSE, and the open mode (e.g. as text vs binary). For many files, these are the same. But consider a compressed version of our sample data, vostok.icecore.co2.gzI'm pretty sure the reason we see differences due to
open
mode has to do with the fact that the original data file has DOS style line endings(?) Of course it makes sense the compressed and uncompressed versions have different hashes, so the first two examples don't bother me, but the second two I don't quite understand.