NIEHS / amadeus

https://niehs.github.io/amadeus/
Other
7 stars 1 forks source link

review and update `hash` parameter #136

Closed mitchellmanware closed 3 weeks ago

mitchellmanware commented 1 month ago

hash = TRUE returning same hash for different folders. Investigate why and fix.

descartes1999 commented 1 month ago

@mitchellmanware That may be a desired effect. That suggests the hash is relative to where it is - folder or machine. For something like a targets pipeline we'll have to think if that is what we want

kyle-messier commented 3 weeks ago

@mitchellmanware I just noticed that the hash is also the same for different branches. Perhaps this is what you meant initially.

mitchellmanware commented 3 weeks ago

@kyle-messier This is what I was referring to - I was getting the same hash for different parameters, and even different datasets. Working on this now.

mitchellmanware commented 3 weeks ago
> download_hash(dir = "../data/air.sfc", hash = TRUE) ==
+   download_hash(dir = "../data/weasd", hash = TRUE)
[1] TRUE
mitchellmanware commented 3 weeks ago

One option is to hash file names relative to directory_to_save (hashing full file path names will obviously have differences) and file size in bytes. I found including all hash-able metadata always resulted in different values because items such as user and time of creation will be different, so this version only hashes file name and file size.

#' Create hash of downloaded files.
#' @description
#' Create a combined SHA-1 hash based on the contents and sizes of files
#' in a specified directory. System-specific metadata (e.g. full file paths,
#' access times, or user information) are not tracked, ensuring the hash
#' remains consistent across different systems, users, and access times.
#' @param hash logical(1). Create hash of downloaded files.
#' @param dir character(1). Directory path.
#' @return character(1) Combined SHA-1 hash of the files' contents and sizes.
#' @keywords internal auxiliary
#' @importFrom rlang hash_file
#' @export
download_hash <- function(
  hash = TRUE,
  dir = NULL
) {
  if (hash) {
    h_command <- paste0(
      "(find ",
      shQuote(dir),
      " -type f -print0 | sort -z | ",
      "xargs -0 sha1sum -- | awk '{print $1}'; ",
      "find ",
      shQuote(dir),
      " -type f -print0 | sort -z | ",
      "xargs -0 stat -c '%s') | sha1sum"
    )
    h <- system(h_command, intern = TRUE)
    h_clean <- sub("  -$", "", h)
    return(h_clean)
  }
}

Example behavior: returns same hash value for identical data inputs downloaded in different directories, but a different hash for a different variable.

> amadeus::download_data(
+   directory_to_save = "../data/first",
+   acknowledge = TRUE,
+   download = TRUE,
+   hash = TRUE,
+   dataset_name = "narr",
+   variables = "soilm",
+   year = c(2020, 2022)
+ )
Downloading requested files...

Requested files have been downloaded.

[1] "10b752a21ad7f6e885ce2247eaf602b5232dff48"
> amadeus::download_data(
+   directory_to_save = "../data/second",
+   acknowledge = TRUE,
+   download = TRUE,
+   hash = TRUE,
+   dataset_name = "narr",
+   variables = "soilm",
+   year = c(2020, 2022)
+ )
Downloading requested files...

Requested files have been downloaded.

[1] "10b752a21ad7f6e885ce2247eaf602b5232dff48"
> amadeus::download_data(
+   directory_to_save = "../data/third",
+   acknowledge = TRUE,
+   download = TRUE,
+   hash = TRUE,
+   dataset_name = "narr",
+   variables = "weasd",
+   year = c(2020, 2022)
+ )
Downloading requested files...

Requested files have been downloaded.

[1] "30b8235c4fd0110c8f4621e359f9ce84bfae288e"
mitchellmanware commented 3 weeks ago

https://github.com/NIEHS/amadeus/commit/bf070f56fd7c1a255bacdcee52f0a3d2545f5033