bluegreen-labs / ecmwfr

Interface to the public ECMWF API Web Services
https://bluegreen-labs.github.io/ecmwfr/
Other
106 stars 30 forks source link

Building requests #21

Closed eliocamp closed 5 years ago

eliocamp commented 5 years ago

I was thinking about how to go about the issue of specifying requests. In my experience as a recent user, creating them by hand is really complicated because of the many options, unmemorable names of datasets and invalid combinations. It would be great to have a more intuitive way of specifying requests, but it seems a daunting task for the same reasons and also the vast variety of datasets and users.

Of course one can go to the web api and click the fields. But that gets annoying fast because the sheer amount of clicking required to get a somewhat complete dataset (37 clicks to get all levels in ERA Interim). It also has some limitations, like not being able to select more than one year in daily fields.

What I (and probably most people) do is go to the website to get a request skeleton for the type of data I want (monthly, pressure levels, etc...), paste it into R and build around it, adding dates, changing resolution, etc... I think that of formalising this process could be a good start.

My idea is to have "archetypes" that would be a combination of a list and optionally default arguments and a function that combines the archetype with new data to build the request. This is a rough sketch:

add_defaults <- function(object, ...) {
  defaults <- list(...)
  attr(object, "defaults") <- defaults
  object
}

build_request <- function(archetype, ...) {
  new <- list(...)
  data <- c(attr(archetype, "defaults"), new)
  lapply(archetype, function(x) as.character(glue::glue_data(data, x)))
}

Then one would go to the website and build a list like this:

ERAI_monthly_levs <- list(
  class = "ei",
  dataset = "interim",
  expver = "1",
  levtype = "pl",
  stream = "moda",
  type = "an",
  format = "netcdf",
  date = "{date}",
  grid = "{res}/{res}",
  levelist = "{levs}",
  param = "155.128",
  target = "output"
)

ERAI_monthly_levs <- add_defaults(res = 2.5, levs = 500)

where the strings with curly brackets are the variables that will be populated. To build a real request one would do something like:

request <- build_request(ERAI_monthly_levs, 
                         date = "20100101", 
                         res = "2.5")
str(request)
#> List of 12
#>  $ class   : chr "ei"
#>  $ dataset : chr "interim"
#>  $ expver  : chr "1"
#>  $ levtype : chr "pl"
#>  $ stream  : chr "moda"
#>  $ type    : chr "an"
#>  $ format  : chr "netcdf"
#>  $ date    : chr "20100101"
#>  $ grid    : chr "2.5/2.5"
#>  $ levelist: chr "500"
#>  $ param   : chr "155.128"
#>  $ target  : chr "output"

and then pass that to wf_request().

What do you think?

khufkens commented 5 years ago

Agreed, sounds like a very nice solution.

As you mentioned, most queries are build by hand which is annoying.

The original purpose of the package is to serve as a back end to my phenor package to provide data. Now, after formalizing it into a separate package things are getting a bit out of hand as it serves a far wider community (and the querying part comes into view, which is hidden and constant in phenor) :smile:.

So yes, I like this solution. Also, apologies for not contributing too much myself. I'm rather tied up in other projects which need my attention more urgently.

khufkens commented 5 years ago

With ease of use in mind I should work on this (if I find time): https://github.com/khufkens/ecmwfrExtra

The idea here is to build upon the data streams and put together a number of common functions to serve a lot of communities / cross cutting fields (think PDSI for drought monitoring - ecology / insurance etc, generate synoptic maps, etc). If you have any idea coming from an atmospheric sciences background, these are always welcome there as well.

These ideas should come together in a workshop around this funded by the RConsortium - a hackaton - and hosted at ECMWF to make people familiar with the data and potential applications.

eliocamp commented 5 years ago

ecmwfrExtra sounds great! In my lab one of the hurdles for using data from the ECMWF is that is relatively hard to get the data. In comparison, you can download the NCEP reanalysis with a simple wget command or a web interface that allows bigger requests with lest clicking. So we end up using it even though is an inferior product by many measures. 😓️ If I can help, count me in!

Do you think that this new functionality should be in this package or in a new ecmwfrExtra?

khufkens commented 5 years ago

Building requests is definitely part of the main package.

The goals would be to keep the main package for querying data, and potentially very basic operations (file format conversions if required). The extras package would be to deal with common operations upon the data which can be more domain specific, i.e. indices etc.

eliocamp commented 5 years ago

This is one possible implementation of the idea of archetypes. In this case, the user buidls an archetype as if it were a named list. The archetype object is actually a function that returns the full request when evaluated.

#
# Some helper functions to format dates and vectors into the MARS format
#
wf_format_dates <- function(dates) {
  paste0(lubridate::year(dates),
         formatC(lubridate::month(dates), width = 2, flag = "0"),
         formatC(lubridate::day(dates), width = 2, flag = "0"),
         collapse = "/")
}

wf_slash_vector <- function(vector) {
  paste0(vector, collapse = "/")
}

#
# Main function. Takes a list and optional arguments.
# Returns a function that takes arguments and return the populated request.
#
wf_archetype <- function(query, ...) {
  query_exp <- rlang::enexpr(query)
  extra_args <- match.call(expand.dots = FALSE)$`...`
  has_default <- names(extra_args) != ""

  vars <- unique(c(all.vars(query_exp),
                   names(extra_args[has_default]),
                   as.character(extra_args[!has_default])
  ))

  args <- setNames(rep(list(rlang::expr()),
                       length(vars)),
                   vars)
  args[vars %in% c(names(extra_args))] <- extra_args[has_default]

  f <- rlang::new_function(args, query_exp)
  class(f) <- c("ecmwfr_archetype", class(f))
  f
}

# Functions for pretty printing
as.list.ecmwfr_archetype <- function(x, ...) {
  l <- as.list(body(x))[-1]
}

print.ecmwfr_archetype <- function(x, ...) {
  components <- as.list(x)
  is_dynamic <- lapply(components, class) == "call"
  max_char_name <- max(vapply(names(components), nchar, 1))
  texts <- vapply(components, deparse, "a")
  max_char_text <- max(nchar(texts))

  rpad <- function(text, width) {
    formatC(text, width = -width, flag = " ")
  }

  cat("Request archetype with values: \n")
  for (comps in seq_along(components)) {
    star <- ifelse(is_dynamic[comps], " *", "")
    cat(" ",
        rpad(names(components)[comps], max_char_name),
        "=",
        rpad(texts[comps], max_char_text), star, "\n")
  }
  cat("arguments: ")
  args <- formals(x)
  for (a in seq_along(args)) {
    cat(names(args)[a])
    if (args[[a]] != rlang::expr()) {
      cat(" =", args[[a]])
    }
    if (a != length(args)) cat(", ", sep = "")
  }
}

Usage:

ERAI <- wf_archetype(
  list(class = "ei",
       dataset = "interim",
       expver = "1",
       levtype = "pl",
       stream = "moda",
       type = "an",
       format = "netcdf",
       date = wf_format_dates(date),
       grid = paste0(res, "/", res),
       levelist = wf_slash_vector(levs),
       param = "155.128",
       target = "output"),
  res = 3                               # sets default argument
)

ERAI is now a function what takes arguments date, res and levs and returns a list with the above expressions evaluated in the context of those arguments.

str(ERAI("2010-01-01", 3, 200))
#> List of 12
#>  $ class   : chr "ei"
#>  $ dataset : chr "interim"
#>  $ expver  : chr "1"
#>  $ levtype : chr "pl"
#>  $ stream  : chr "moda"
#>  $ type    : chr "an"
#>  $ format  : chr "netcdf"
#>  $ date    : chr "20100101"
#>  $ grid    : chr "3/3"
#>  $ levelist: chr "200"
#>  $ param   : chr "155.128"
#>  $ target  : chr "output"

And with a nice printing method.

print(ERAI)
#> Request archetype with values: 
#>   class    = "ei"                   
#>   dataset  = "interim"              
#>   expver   = "1"                    
#>   levtype  = "pl"                   
#>   stream   = "moda"                 
#>   type     = "an"                   
#>   format   = "netcdf"               
#>   date     = wf_format_dates(date)  * 
#>   grid     = paste0(res, "/", res)  * 
#>   levelist = wf_slash_vector(levs)  * 
#>   param    = "155.128"              
#>   target   = "output"               
#> arguments: date, res = 3, levs

Testing that it works

ecmwfr::wf_request(ERAI("2010-01-01", 3, 200),
                   "eliocampitelli@gmail.com")
#> - staging data transfer at url endpoint or request id:
#>   https://api.ecmwf.int/v1/datasets/interim/requests/5c9543622fd81cdb95d5203e
#>   No download requests will be made, however...
#> - Your request has been submitted as a WEBAPI request.
#> 
#>   Even after exiting your request is still beeing processed!
#>   Visit https://apps.ecmwf.int/webmars/joblist/
#>   to manage (download, retry, delete) your requests
#>   or to get ID's from previous requests.
#> 
#> - Retry downloading as soon as as completed:
#>   wf_transfer(url = 'https://api.ecmwf.int/v1/datasets/interim/requests/5c9543622fd81cdb95d5203e
#> <user>,
#>  ',
#>  path = '/tmp/RtmpqsTx4e',
#>  filename = 'output',
#>  service = 'webapi')
#> 
#> - Delete the job upon completion using:
#>   wf_delete(<user>,
#>  url ='https://api.ecmwf.int/v1/datasets/interim/requests/5c9543622fd81cdb95d5203e')

Pros:

Cons:

Created on 2019-03-22 by the reprex package (v0.2.1)

khufkens commented 5 years ago

Your first iteration made more sense, and was more intuitive, than the last I feel. Here, you swap values of the list. I think I might have misunderstood your principle.

Although the solution is clean I see issues with the complexity of it all - and potential issues when sharing scripts which do not include these custom calls and above all user support! Wouldn't it make more sense to have a function which takes a list, iterates over all items and modifies those which match with those provided?

In this case your original (MARS) request (list) is your archetype (skipping the construction of a dedicated one with its own syntax etc), which you modify dynamically using a universal (package) function. This skips custom user created functions, which can become a hellhole for us, mostly because people will start putting up requests to help with these (and since multi-levelled there is a whole lot which can go wrong). With the swap things are 1 to 1 and you can just point at the MARS documentation to have them check their syntax, rather than digging into their custom code when things "don't work". TBH, don't want those email coming our way.

# your standard MARS request (queried once) or base archetype
monthly_precip_era <- list( class = "ei", ...)

# adapting or adding to this archetype
date_range <- wf_build_request(monthly_precip_era,
 date = "date1/to/date2",
 res = 2)
multi_date_selection <- wf_build_request(monthly_precip_era,
 date = "date1/date2/date3",
 res = 0.5)

# download
wf_request(date_range, "john@example.com")
wf_request(multi_date_selection, "john@example.com")
khufkens commented 5 years ago

So this would read, without any error trapping.


wf_modify_request <- function(request, ...){

  # check the request statement
  if(missing(request) || !is.list(request)){
    stop("not a request")
  }

  # load dot arguments
  dot_args <- list(...)

  # loop over everything
  do.call("c",lapply(names(request), function(request_name){

    # get a replacement value if matching
    # a name in the original request
    replacement <- dot_args[request_name]

    # no clue why is.null doesn't work in this case
    # print might just fill this with NULL
    if(is.na(names(replacement))) {
      return(request[request_name])
    } else {
      return(replacement)
    }
  }))
}

base_request <- list(stream = "oper",
                   levtype = "sfc",
                   param = "165.128/166.128/167.128",
                   dataset = "interim",
                   step = "0",
                   grid = "0.75/0.75",
                   time = "00/06/12/18",
                   date = "2014-07-01/to/2014-07-31",
                   type = "an",
                   class = "ei",
                   area = "73.5/-27/33/45",
                   format = "netcdf",
                   target = "tmp.nc")

new_request <- wf_modify_request(request = base_request,
                        date = "some date",
                        area = "some lat long combo",
                        dataset = "other interim")

print(new_request)
#> $stream
#> [1] "oper"
#> 
#> $levtype
#> [1] "sfc"
#> 
#> $param
#> [1] "165.128/166.128/167.128"
#> 
#> $dataset
#> [1] "other interim"
#> 
#> $step
#> [1] "0"
#> 
#> $grid
#> [1] "0.75/0.75"
#> 
#> $time
#> [1] "00/06/12/18"
#> 
#> $date
#> [1] "some date"
#> 
#> $type
#> [1] "an"
#> 
#> $class
#> [1] "ei"
#> 
#> $area
#> [1] "some lat long combo"
#> 
#> $format
#> [1] "netcdf"
#> 
#> $target
#> [1] "tmp.nc"

Created on 2019-03-23 by the reprex package (v0.2.1)

khufkens commented 5 years ago

In the above setup no accommodations are made to add parameters, but maybe this is for the better as there is no way to check the call unless you really submit it. The documentation on the MARS requests vocabulary is not well documented (in one place). So checking the call will be hard if elements are added which can't be directly verified.

The setup is less fancy, but also easier for people to wrap their head around I think (as it swaps values).

I'm not sure how well this could / would play with pipes however.

eliocamp commented 5 years ago

I thought about it but there are some reasons that I like the function factory approach.

It's flexible in the amount of behaviour you can put in the archetype. It can compute on the arguments as the example of using res = 3 instead of red = 3/3. This can be extended to fancier computations or using arguments that are not MARS keywords.

But at the same time, it's more strict. If you just swap keywords there's no way of specifying which keywords are fixed and which are dynamic, leaving the possibility returning invalid requests. I like the option to have that an archetype ERA_interim that will only return requests from the ERA Interim database and the user cannot screw it up by using wf_modify_request(request = ERA_interim, dataset = "e20").

I do share your concern about explaining the function factory to users, but I don't think is that different in practice and for the most part, they don't really need to. In the most basic version, you also use just use the basic MARS request changing only the fields you want to make dynamic:

ERAI <- wf_archetype(
  list(class = "ei",
       dataset = "interim",
       expver = "1",
       levtype = "pl",
       stream = "moda",
       type = "an",
       format = "netcdf",
       date = date,
       grid = res,
       levelist = levs
       param = "155.128",
       target = "output")
)

I also envision that either this package or the ecmwfrExtra package could later supply a number of ready made archetypes for the most common types of request. Then for the most basic usage of all, the user wouldn't even need to build the archetype themselves and would have a very easy access to functions that return valid requests reliably.

khufkens commented 5 years ago

It is true that the simple approach leaves open the chance of a badly formatted parameter. But in this case the solution is simple, check their query. The error is visible and easily interpreted. Fixing the errors will also be on the user, not us.

Now, for example, wf_format_date() codes for a single date. However, the MARS requests cover multiple date formats, which might or might not function depending on dataset and time frames, all of which need to checked for if you provide the wf_format_date() as an option to format a query. And then there is the CDS backend which uses a different date standard altogether. Basically, we have to support all of these, and exceptions (or none of them).

Now only to trap these errors would require more code than I would love to care for. You need context, so you need to forward the whole query to the wf_format_dates() function, match to product type, and format the query accordingly. Then you have the issue of trapping errors if users input incorrect data, otherwise you might still end up with the wrong query. If you want to be sure of all this, you run unit tests against all of these combinations. Even still you might miss something and someone will file a bug, ask for help, and you can fix it (rather than users having to deal with inputting correct data!). This is only the date format! I don't want to imagine where this rabbit hole can lead to.

I can live with this setup below, without custom formatting functions. But I'm not willing to support anything else. I'm operating out of the assumption that if you decide to not maintain pieces you wrote (new job, family commitments, no interest) I need to be able to maintain the codebase (mostly in terms of time). I hope you continue to contribute but in this case I fear I don't have the time for what you propose, should you step back. The latter is also the reason why most of my packages are really lean and tend to do one thing, and one thing really well (unix philosophy). It saves me time as it puts some responsibility squarely with the user. From this perspective, gains are relatively minor, while the overhead is significant (when adding custom formatting + checks).

ERAI <- wf_archetype(
  list(class = "ei",
       dataset = "interim",
       expver = "1",
       levtype = "pl",
       stream = "moda",
       type = "an",
       format = "netcdf",
       date = date,
       grid = res,
       levelist = levs
       param = "155.128",
       target = "output")
)
khufkens commented 5 years ago

So, short summary from the above. I fear that what you propose, although good, is not feasible in terms of long term maintenance (and some responsibility needs to be placed with the user).

eliocamp commented 5 years ago

The format functions were only helper functions to test and show using custom code in there. I didn't meant for them to be an integral part of the request-building process (at least not before careful consideration because, as you say, there are a lot of things to consider!). I should've been more clear, sorry!

All those issues with formatting is one reason I like the idea of having a strict request-builder. I would loooove being able to have a function that builds valid requests in a human readable way (that doesn't involve dozens of clicks!) But I spend a whole day just trying to understand how to get data from just one dataset. :sweat:

khufkens commented 5 years ago

I just wanted to put it all into perspective before you put too much effort in formatting functions. I fear it is a Sisyphus project. This complexity is probably the reason why you have to build a request on the website, and not through the APIs directly. It's too hard, even for the people at ECMWF to put time into. Given that we don't have insight into all products and intricacies the best way is to build from an original query, and only tailor limited options. Most commonly, time and space settings. Alternatively we could check the request. This is possible for the webapi not for CDS.

Would it be good to compromise like this:

ERAI <- wf_archetype(
  request = list(class = "ei",
       dataset = "interim",
       expver = "1",
       levtype = "pl",
       stream = "moda",
       type = "an",
       format = "netcdf",
       date = "1",
       grid = "3/3",
       levelist = "dal/dlad",
       param = "155.128",
       target = "output"),
      dynamic_values = c(date, res, levs)
)
khufkens commented 5 years ago

Implemented both wf_modify_request() and wf_archetype() after budding my head on formatting expressions from list elements for too long. Might change in the future, but for now both exist next to each other as I have other fish to fry.

The wf_archetype() function comes with a very bold disclaimer that I can gaurantee the functionality of the function itself, not what it generates. Basically, I will not provide support for any user generated functions (therefore the setup is too flexible).

closed per: https://github.com/khufkens/ecmwfr/pull/22

eliocamp commented 5 years ago

You beat me to it! Sorry I was with some time-sensitive work. One of them using data downloaded with this package, hehe :grin: .

khufkens commented 5 years ago

Good to see you have use for it already!

You seem more proficient with the expression calls, if you would know a fix for integrating both it would still would be nice. I got stuck with the fact that if the list is not "hand coded" with arguments I can't pass the correct type to the list element. No clue what is required, but it it didn't stick. Basically the call didn't interpret the list elements set to "being a variable". Anyway, will do for now.

khufkens commented 5 years ago

btw, I always welcome fixes to docs, vignettes and unit checks.

These are a pain, but make later work easier. I think most should be covered for the new functions (the basics), but no examples are given in vignettes on automation for example. If you have code you used recently that can serve for this let me know (small spatial extent preferred for speed - actually I query a small local dataset to speed up builds / cheap trick :smile: ).

eliocamp commented 5 years ago

Ok! I'll check both things out and see what I can do.

khufkens commented 5 years ago

Thanks, with those I think I might push the changes as a next release. Enough fixes have accumulated I feel.

eliocamp commented 5 years ago

I though about how to seamlessly integrate both, but not coming up with anything solid. This would be something close. wf_modify_request() could modify archetypes:

ERA_interim <- wf_archetype(
  list(class = "ei",
       dataset = "interim",
       expver = "1",
       levtype = "pl",
       stream = "moda",
       type = "an",
       format = "netcdf",
       date = date,
       grid = paste0(res, "/", res),
       levelist = levs,
       param = "155.128",
       target = "output"),
  res = 3  # sets default argument
)

# Change param to temperature
wf_modify_request(ERA_interim,
                  param = "130.128")

#> Request archetype with values: 
#>   class    = "ei"                   
#>   dataset  = "interim"              
#>   expver   = "1"                    
#>   levtype  = "pl"                   
#>   stream   = "moda"                 
#>   type     = "an"                   
#>   format   = "netcdf"               
#>   date     = date                   * 
#>   grid     = paste0(res, "/", res)  * 
#>   levelist = levs                   * 
#>   param    = "130.128"              
#>   target   = "output"               
#> arguments: date, res = 3, levs

But I'm not sure if it makes sense. With this modification wf_modify_request() could return very different objects depending on the input object. Did you have another thing in mind?

PS: I've realised that wf_modify_request() is basically the same that using within(). wf_modify_request(base, date = new_date) is equal to within(base, {date <- new_date}).

khufkens commented 5 years ago

What I want to get to is this:

ERAI <- wf_archetype(
  request = list(stream = "oper",
                   levtype = "sfc",
                   param = "165.128/166.128/167.128",
                   dataset = "interim",
                   step = "0",
                   grid = "0.75/0.75",
                   time = "00/06/12/18",
                   date = "2014-07-01/to/2014-07-31",
                   type = "an",
                   class = "ei",
                   area = "73.5/-27/33/45",
                   format = "netcdf",
                   target = "tmp.nc"),
      dynamic_values = c("date", "time")
)

So similar but the other way around. Reason being that it is more transparent in how the query was build. In particular, one retains the original MARS/CDS query while building the function. This will help a great deal in debugging. If the original query is not retained is hard to know why the code would break for example - there is no way of checking if the new query builds the arguments into the correct form.

When you specify which values should be dynamic you can check if they exist (sanity check), and their names are kept consistent with the original query values used. The latter limits confusion by, accidental, dual use of variables and MARS query parameters etc.

It does limit custom functions in the archetype itself, but I feel they should be avoided as they do not scale. The are still possible but they would be visible (i.e. easier to debug). For example, you could build a query automatically as such, here the paste() bit isn't hidden in the archetype function itself.

ERAI(date = paste0(date1,"/to/",date2), time = "00")

An additional feature of this approach would be that you can retain the original query, i.e. you can retain sensible default values. So basically you build a function with the dynamic variables as arguments, but also assign them the original values. This way you can specify the most common parameters to change, but not have a function for every parameter combination. The latter would do away with setting default arguments in the current implementation.

For this to work, I "just" need to replace the current value in say date with an unassigned variable called date, which can be substituted by the build function argument. I failed on the last bit, basically assigning field date a date variable always resulted into it being seen as a string / or other non interpretable type. Maybe within allows for this.

Anyway, the above would allow for the flexibility of the archetypes, with the visibility / transparency of a simple modify or within statement. Thoughts?

eliocamp commented 5 years ago

I like it. It has the benefit of always working with a valid request (or, more exactly, a request as valid as the original one) and the strictt has the benefit of always working with a valid request (or, more exactly, a request as valid as the original one) and the strictness of only changing some values so the result always matches the name of the variable. I see that it's better to leaness of only changing some values so the result always matches the name of the variable. I see that it's better to leave custom formatting or other computations as arguments for the function when called. A more experienced user can create their own custom functions, anyway.

I'm sending a PR now with the changes.

khufkens commented 5 years ago

Thanks for all the changes, this makes a big difference I think. We could remove the wf_modify_request() option, but it might serve some still (maybe in custom functions etc). Not sure what to do with it. I guess it doesn't hurt to keep it for now.

I'll probably see to it that code coverage is good and then will push this new version to CRAN. More than enough changes have accumulated.