OCHA-DAP / hdx-signals

HDX Signals
https://un-ocha-centre-for-humanitarian.gitbook.io/hdx-signals/
GNU General Public License v3.0
5 stars 0 forks source link

`update_az_file()` - move `gmas_test_run()` condition to top? #137

Closed zackarno closed 2 months ago

zackarno commented 2 months ago

would it make sense to move the gmas_test_run() check to the top of the function. That way if it's TRUE no need to connect to the blob at all?

caldwellst commented 2 months ago

Yeah, it's a discussion that we were having with @hannahker as well. I worry if we just moved check to top, we aren't running the saving out functionality, which may impact some testing down the line. However, if we are moving testing into #131 to be separate, and GMAS_TEST_RUN is being renamed HS_DRY_RUN, could make this change. I think still would want to discuss if we want to completely available loading the blob here and running switch(), cause the dry run functionality can be a nice way to interactively develop and make sure nothing is breaking and viewing output emails. Let's see what Hannah says when back!

caldwellst commented 2 months ago

So, ran through update_az_file() to investigate this:

container <- get_container(container)
fileext <- tools$file_ext(name)
tf <- tempfile(fileext = paste0(".", fileext))

switch(fileext,
  csv = readr$write_csv(x = df, file = tf, na = ""),
  parquet = arrow$write_parquet(x = df, sink = tf),
  json = jsonlite$write_json(x = df, path = tf),
  geojson = sf$st_write(obj = df, dsn = tf, quiet = TRUE)
)

The only potential need to connect to Azure would be in get_container(), which pulls in the container_...() functions. Looking through container_prod() for example:

container_endpoint_prod <- az$blob_endpoint(
  endpoint = azure_endpoint_url("blob", "prod"),
  sas = get_env("DSCI_AZ_SAS_PROD")
)
az$blob_container(
  endpoint = container_endpoint_prod,
  name = "hdx-signals"
)

And both of these, while they look complex, are simply returning list objects with a special S3 class.

AzureStor::blob_endpoint

function (endpoint, key = NULL, token = NULL, sas = NULL, api_version = getOption("azure_storage_api_version")) 
{
    obj <- list(url = endpoint, key = key, token = token, sas = sas, 
        api_version = api_version)
    class(obj) <- c("blob_endpoint", "storage_endpoint")
    obj
}
<bytecode: 0x7fbef90bfd08>
<environment: namespace:AzureStor>

AzureStor:::blob_container.blob_endpoint

function (endpoint, name, ...) 
{
    obj <- list(name = name, endpoint = endpoint)
    class(obj) <- c("blob_container", "storage_container")
    obj
}
<bytecode: 0x7fbefd30ea68>
<environment: namespace:AzureStor>

So we actually don't have to connect at all to the blob in these steps, works entirely without internet access. Only validated at the point of access.