NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
4 stars 0 forks source link

Location of download functions #164

Closed mitchellmanware closed 8 months ago

mitchellmanware commented 10 months ago

Why do download/process functions live in the 'input' folder and not 'R'.

MAKassien commented 10 months ago

It's because we wanted to leave them in the same folder as the data they generated, plus the GitHub tests didn't like these download functions because of an issue with needing read-write privileges to download files. It was a discussion we had with @Spatiotemporal-Exposures-and-Toxicology and @sigmafelix so they may be able to explain more in detail.

This is actually a good thing to bring up because it may inform how we organize (& test) the covariate download scripts and data folders.

mitchellmanware commented 10 months ago

That all makes sense, thank you. Your point about testing the download scripts is what brought the question up.

sigmafelix commented 10 months ago

@MAKassien @mitchellmanware Thank you for bringing up the issue. I think we could include download-and-preprocess functions into ./R, but my concern is it should not pass the test, as we briefly discussed the write error for testing download functions. I think that we could add some test scripts for such functions, which expect errors. Then the function is partly (at least) tested. Ideally, all download functions would rest in ./R/download.R (for example).

kyle-messier commented 10 months ago

@MAKassien @mitchellmanware @sigmafelix

When the download functions are in ./R/ then the test-coverage and R-CMD checks initiate their run and ending up redownloading the data. I agree with @sigmafelix that we should them all in one R function in R/ folder. However, we have to do something to bypass their running the download commands. Some potential options include (1) skip specific lines of line code in tests and build. (2) Add something to the R code that checks if the files exist, if so, then skip the download lines.

mitchellmanware commented 10 months ago

Is anyone else having trouble accessing the VPN or is it just me?

Update- My wifi decided to stop for while 😅

mitchellmanware commented 9 months ago

Follow up- November 22, 2023 @Spatiotemporal-Exposures-and-Toxicology @sigmafelix @MAKassien

As mentioned in the previous comments, a potential way to add the download functions into the .R/ folder is by adding a "test mode" argument to the function. Using the argument to bypass the download command could cover most of the code in the function. For example, only two lines of code would be uncovered in the following scenario (download_geos_cf_data.R).

  # above is rest of the download function #
  #### 11. test mode
  if (test_mode == FALSE) {
    #### 12 download data
    system(command = system_command)
    #### 13 remove "..._wget_commands.txt" file
    file.remove(commands_txt)
  }

We can write tests to check that the ..._wget_commands.txt file exists, as well as the contents of the file (ie. number of rows matches expected based on defined date sequence) without actually executing the download commands.

Would the test coverage and R-CMD check read-write privileges also prohibit the creation of the ..._wget_commands.txt file?

MAKassien commented 9 months ago

Thanks @mitchellmanware I've added this to the new version of download_tri functuon that I will push to branch #185 today. I think the original problem was with web downloads specifically, so creating a local file like that .txt may not be a problem? Although Kyle and Insang may have a better idea of how this works. @Spatiotemporal-Exposures-and-Toxicology @sigmafelix

sigmafelix commented 9 months ago

@mitchellmanware @MAKassien We might allow writing files in GitHub Action runner by changing yaml settings. Before implementing changes in GitHub Action side, how about generating a character vector internally and sink it if an argument is set TRUE? In corresponding test scripts, we could only get the character vector with URLs and do some "presence check" with a function in httr package (StackOverflow post). For example:

testthat::test_that("Check if files on Web addresses are present", {
    withr::local_package("httr")
    # provided that download_foo includes an argument to return character object
    urllist <- download_foo(..., return_urls = TRUE)

    urlFileExist <- function(url){
        HTTP_STATUS_OK <- 200
        hd <- httr::HEAD(url)
        status <- hd$all_headers[[1]]$status
        return(status == HTTP_STATUS_OK)
    }

    # randomly pick urls and check
    url_pick <- sample(urllist, 30L, replace = FALSE)
    url_status <- sapply(url_pick, urlFileExist)

    testthat::expect_true(is.character(urllist))
    testthat::expect_true(all(url_status))

})

Of course, we should modify DESCRIPTION file to allow httr in GitHub Action runners. This is another discussion topic for next Monday.

mitchellmanware commented 9 months ago

@MAKassien @sigmafelix The test_mode argument in this example does what Insang suggested. It sinks a character vector (.txt file) with all of the commands but does not execute the download command.

> ################################################################################
> testthat::test_that("Check if files on Web addresses are present", {
+   withr::local_package("httr")
+   # provided that download_foo includes an argument to return character object
+   
+   # download arguments
+   directory_to_save <- "/Volumes/manwareme/function_copies/tests/data/"
+   collection <- "chm_tavg_1hr_g1440x721_v1"
+   date_start <- "2018-01-01"
+   date_end <- "2018-12-31"
+   
+   # download
+   download_geos_cf_data(date_start = date_start,
+                         date_end = date_end,
+                         collection = collection,
+                         directory_to_save = directory_to_save,
+                         data_download_acknowledgement = TRUE,
+                         test_mode = TRUE)
+   
+   # read commands
+   wget_path <- paste0(directory_to_save,
+                       collection,
+                       "_wget_commands.txt")
+   wget_commands <- read.csv(wget_path, header = FALSE)
+   
+   # convert to character
+   wget_commands <- wget_commands[1:nrow(wget_commands),]
+   
+   # extract URLs from command
+   urllist <- NULL
+   for (w in seq_along(wget_commands)) {
+     string <- wget_commands[w]
+     url <- str_split_i(string, " ", 2)
+     urllist <- c(urllist, url)
+   }
+ 
+   urlFileExist <- function(url){
+     HTTP_STATUS_OK <- 200
+     hd <- httr::HEAD(url)
+     status <- hd$all_headers[[1]]$status
+     return(status == HTTP_STATUS_OK)
+   }
+   
+   # randomly pick urls and check
+   url_pick <- sample(urllist, 30L, replace = FALSE)
+   url_status <- sapply(url_pick, urlFileExist)
+   
+   testthat::expect_true(is.character(urllist))
+   testthat::expect_true(all(url_status))
+   
+ })
Test passed 🎉

Similar to other tests, the urlFileExist function maybe should live in the R/ folder as it will be applied to all of the download function tests.

kyle-messier commented 9 months ago

@sigmafelix add rvest and httr to description file in Imports

kyle-messier commented 9 months ago

Current plan:

  1. Utilize a path parameter with data on ddn NRTAPmodel repo

Potential update:

  1. Utilize git-lfs
kyle-messier commented 8 months ago

@mitchellmanware I think we can close this. Please confirm and close if so.

mitchellmanware commented 8 months ago

@Spatiotemporal-Exposures-and-Toxicology Confirming complete.