Closed mitchellmanware closed 3 weeks ago
@mitchellmanware That may be a desired effect. That suggests the hash is relative to where it is - folder or machine. For something like a targets
pipeline we'll have to think if that is what we want
@mitchellmanware I just noticed that the hash is also the same for different branches. Perhaps this is what you meant initially.
@kyle-messier This is what I was referring to - I was getting the same hash for different parameters, and even different datasets. Working on this now.
> download_hash(dir = "../data/air.sfc", hash = TRUE) ==
+ download_hash(dir = "../data/weasd", hash = TRUE)
[1] TRUE
One option is to hash file names relative to directory_to_save
(hashing full file path names will obviously have differences) and file size in bytes. I found including all hash-able metadata always resulted in different values because items such as user and time of creation will be different, so this version only hashes file name and file size.
#' Create hash of downloaded files.
#' @description
#' Create a combined SHA-1 hash based on the contents and sizes of files
#' in a specified directory. System-specific metadata (e.g. full file paths,
#' access times, or user information) are not tracked, ensuring the hash
#' remains consistent across different systems, users, and access times.
#' @param hash logical(1). Create hash of downloaded files.
#' @param dir character(1). Directory path.
#' @return character(1) Combined SHA-1 hash of the files' contents and sizes.
#' @keywords internal auxiliary
#' @importFrom rlang hash_file
#' @export
download_hash <- function(
hash = TRUE,
dir = NULL
) {
if (hash) {
h_command <- paste0(
"(find ",
shQuote(dir),
" -type f -print0 | sort -z | ",
"xargs -0 sha1sum -- | awk '{print $1}'; ",
"find ",
shQuote(dir),
" -type f -print0 | sort -z | ",
"xargs -0 stat -c '%s') | sha1sum"
)
h <- system(h_command, intern = TRUE)
h_clean <- sub(" -$", "", h)
return(h_clean)
}
}
Example behavior: returns same hash value for identical data inputs downloaded in different directories, but a different hash for a different variable.
> amadeus::download_data(
+ directory_to_save = "../data/first",
+ acknowledge = TRUE,
+ download = TRUE,
+ hash = TRUE,
+ dataset_name = "narr",
+ variables = "soilm",
+ year = c(2020, 2022)
+ )
Downloading requested files...
Requested files have been downloaded.
[1] "10b752a21ad7f6e885ce2247eaf602b5232dff48"
> amadeus::download_data(
+ directory_to_save = "../data/second",
+ acknowledge = TRUE,
+ download = TRUE,
+ hash = TRUE,
+ dataset_name = "narr",
+ variables = "soilm",
+ year = c(2020, 2022)
+ )
Downloading requested files...
Requested files have been downloaded.
[1] "10b752a21ad7f6e885ce2247eaf602b5232dff48"
> amadeus::download_data(
+ directory_to_save = "../data/third",
+ acknowledge = TRUE,
+ download = TRUE,
+ hash = TRUE,
+ dataset_name = "narr",
+ variables = "weasd",
+ year = c(2020, 2022)
+ )
Downloading requested files...
Requested files have been downloaded.
[1] "30b8235c4fd0110c8f4621e359f9ce84bfae288e"
hash = TRUE
returning same hash for different folders. Investigate why and fix.