Closed aaronpeikert closed 1 year ago
Is there a way to ensure matching digest and R version? I had thought that renv would take care of this :(
Also, thanks for identifying the source of the problem!
renv
should ensure the correct package version of digest, but I don't know about R version... I would add the versions to the hash as metadata and give a more informative error message in case the match fails.
That will be good for debugging, but it's not really a long term solution :(
I'm thinking about this more, and this can't be right. Checksums of files should be platform/r/package version independent. I've found online that other people experienced variant checksums for R objects due to differences in serialization across platforms - but since we're computing the checksum for an existing file, that should not be the case.
Your right. I forgot that you use file = TRUE
.
I figured it out... its https://docs.github.com/en/get-started/getting-started-with-git/configuring-git-to-handle-line-endings
Linux is using lf line ending (me and GHA) and you hashed with crlf (windows).
sed 's/^M$//' manuscript/manuscript.md > lf.txt
sed 's/$'"/`echo \\\r`/" manuscript/manuscript.md > crlf.txt
digest::digest("lf.txt", file = TRUE) # [1] "4604ee06e4830a6768d2b7f172d87e36"
digest::digest("crlf.txt", file = TRUE) # [1] "be73f2b4037584190eee11f90a1f1040"
https://github.com/aaronpeikert/tmpaction/blob/d6aefbbb3a300482df91367c42372e6587ca2955/.worcs#L6
https://github.com/cjvanlissa/tmpaction/blob/291c27df98eff03e4dffb0803ef88d859f2295f2/.worcs#L6
Is there a way to get a checksum that ignores line endings? :|
I fear it is not possible. Git is deciding to change the line endings for you. You can read the file in before hashing:
digest::digest(paste0(readLines("crlf.txt"), collapse = ""), serialize = FALSE, file = FALSE)
digest::digest(paste0(readLines("lf.txt"), collapse = ""), serialize = FALSE, file = FALSE)
But this may be not ideal for binary files and surely is inefficient for big files.
I think we will have to write a function that computes checksums while ignoring line endings. Here's a thread where people went through the same drama: https://github.com/flyway/flyway/issues/253
I will just write a function that does this stupidly (just removing all line endings), but we should give some thought to efficiency as you're pointing out.
It seems to be possible to get information about which files have EOL handling by git. Not yet sure if/how we can use this information. Additionally, readLines gives a line about nul in string if we use the digest code you gave above. Maybe we can use either of those two sources of information to choose an appropriate way to compute the checksum.
gitfiles <- system2("git", '-C "pathtofile" ls-files --eol', stdout = TRUE)
gitfiles <- gitfiles[grepl("/lf", gitfiles, fixed = TRUE)|grepl("/crlf", gitfiles, fixed = TRUE)]
gitfiles <- gsub("^.+attr/.+?\\t", "", gitfiles)
This seems to work: https://github.com/cjvanlissa/tmpaction/actions/runs/5672846656
Could we discuss this during our meeting too?
If binary (no changed line endings), then use digest(file = TRUE) If not binary, then use digest(readLines)
Give warning if readlines is used?
I think we should generally issue a warning when running the checks and the R version does not match.
@cjvanlissa observed this failure:
https://github.com/cjvanlissa/tmpaction/actions/runs/5625928320/job/15245706260
@aaronpeikert updated the hash and got a green checkmark:
https://github.com/aaronpeikert/tmpaction/actions/runs/5643567293
This leads me to believe that the initial hash was calculated with a different version of the
digest
- package or R version. This package does not guarantee the same hash across versions of itself and R versions for the same object, mainly due to differences in how R is serializing objects (seeing it as an implementation detail that may change without notice).