cjvanlissa / worcs

Rstudio project template and convenience functions for the Workflow for Open Reproducible Code in Science (WORCS)
https://cjvanlissa.github.io/worcs/
GNU General Public License v3.0
78 stars 11 forks source link

either `check_endpoints` or `snapshot_endpoints()` or both must check for matching `digest` and R version #131

Closed aaronpeikert closed 1 year ago

aaronpeikert commented 1 year ago

@cjvanlissa observed this failure:

https://github.com/cjvanlissa/tmpaction/actions/runs/5625928320/job/15245706260

@aaronpeikert updated the hash and got a green checkmark:

https://github.com/aaronpeikert/tmpaction/actions/runs/5643567293

This leads me to believe that the initial hash was calculated with a different version of the digest- package or R version. This package does not guarantee the same hash across versions of itself and R versions for the same object, mainly due to differences in how R is serializing objects (seeing it as an implementation detail that may change without notice).

cjvanlissa commented 1 year ago

Is there a way to ensure matching digest and R version? I had thought that renv would take care of this :(

cjvanlissa commented 1 year ago

Also, thanks for identifying the source of the problem!

aaronpeikert commented 1 year ago

renv should ensure the correct package version of digest, but I don't know about R version... I would add the versions to the hash as metadata and give a more informative error message in case the match fails.

cjvanlissa commented 1 year ago

That will be good for debugging, but it's not really a long term solution :(

cjvanlissa commented 1 year ago

I'm thinking about this more, and this can't be right. Checksums of files should be platform/r/package version independent. I've found online that other people experienced variant checksums for R objects due to differences in serialization across platforms - but since we're computing the checksum for an existing file, that should not be the case.

aaronpeikert commented 1 year ago

Your right. I forgot that you use file = TRUE. I figured it out... its https://docs.github.com/en/get-started/getting-started-with-git/configuring-git-to-handle-line-endings

Linux is using lf line ending (me and GHA) and you hashed with crlf (windows).

sed 's/^M$//' manuscript/manuscript.md > lf.txt
sed 's/$'"/`echo \\\r`/" manuscript/manuscript.md > crlf.txt
digest::digest("lf.txt", file = TRUE) # [1] "4604ee06e4830a6768d2b7f172d87e36"
digest::digest("crlf.txt", file = TRUE) # [1] "be73f2b4037584190eee11f90a1f1040"

https://github.com/aaronpeikert/tmpaction/blob/d6aefbbb3a300482df91367c42372e6587ca2955/.worcs#L6

https://github.com/cjvanlissa/tmpaction/blob/291c27df98eff03e4dffb0803ef88d859f2295f2/.worcs#L6

cjvanlissa commented 1 year ago

Is there a way to get a checksum that ignores line endings? :|

aaronpeikert commented 1 year ago

I fear it is not possible. Git is deciding to change the line endings for you. You can read the file in before hashing:

digest::digest(paste0(readLines("crlf.txt"), collapse = ""), serialize = FALSE, file = FALSE)
digest::digest(paste0(readLines("lf.txt"), collapse = ""), serialize = FALSE, file = FALSE)

But this may be not ideal for binary files and surely is inefficient for big files.

cjvanlissa commented 1 year ago

I think we will have to write a function that computes checksums while ignoring line endings. Here's a thread where people went through the same drama: https://github.com/flyway/flyway/issues/253

I will just write a function that does this stupidly (just removing all line endings), but we should give some thought to efficiency as you're pointing out.

cjvanlissa commented 1 year ago

It seems to be possible to get information about which files have EOL handling by git. Not yet sure if/how we can use this information. Additionally, readLines gives a line about nul in string if we use the digest code you gave above. Maybe we can use either of those two sources of information to choose an appropriate way to compute the checksum.

gitfiles <- system2("git", '-C "pathtofile" ls-files --eol', stdout = TRUE)
gitfiles <- gitfiles[grepl("/lf", gitfiles, fixed = TRUE)|grepl("/crlf", gitfiles, fixed = TRUE)]
gitfiles <- gsub("^.+attr/.+?\\t", "", gitfiles)
cjvanlissa commented 1 year ago

This seems to work: https://github.com/cjvanlissa/tmpaction/actions/runs/5672846656

Could we discuss this during our meeting too?

cjvanlissa commented 1 year ago

If binary (no changed line endings), then use digest(file = TRUE) If not binary, then use digest(readLines)

Give warning if readlines is used?

aaronpeikert commented 1 year ago

I think we should generally issue a warning when running the checks and the R version does not match.