Daily averages using CDS API

ainsliej commented 1 year ago

Hi there,

I was wondering whether it was possible to access the Daily Averages ERA-5 data from the CDS API using the package?

I have tried adding the 'statistic' parameter to the request, but it returns an error message "Ambiguous parameter: statistic could be STEP or STYLE".

Thanks for making such a great package, its been so useful to me!

khufkens commented 1 year ago

Sorry for the late response, holidays and all.

This isn't really a true data product (reading the linked document), this is a workflow on the CDS machines (using hourly data) for which the daily summary values are then returned for download. Note the different API call they use c.service() instead of c.retrieve() as for normal data.

I've not considered the option of using these workflows in addition to basic requests for data. I'm not sure this will work, but I'll see if I can make this work. In any case this might remain an niche feature as pretty advanced.

I'll look into it.

khufkens commented 1 year ago

@ainsliej It seems the functionality is exposed via the normal API, so this could be made functional.

The below code (for personal reference mostly) shows the retrieve call, currently used. Below is the service call and API endpoints used. I'm unsure what is returned as 'result', but I guess this will be a URL for downloading data. Anyway, supporting workflows should be possible.

    def retrieve(self, name, request, target=None):
        result = self._api("%s/resources/%s" % (self.url, name), request, "POST")
        if target is not None:
            result.download(target)
        return result

    def service(self, name, *args, **kwargs):
        self.delete = False  # Don't delete results
        name = "/".join(name.split("."))
        mimic_ui = kwargs.pop('mimic_ui', False)
        # To mimic the CDS ui the request should be populated directly with the kwargs
        if mimic_ui:
            request = kwargs
        else:
            request = dict(args=args, kwargs=kwargs)

        if self.metadata:
            request["_cds_metadata"] = self.metadata
        request = toJSON(request)
        result = self._api(
            "%s/tasks/services/%s/clientid-%s" % (self.url, name, uuid.uuid4().hex),
            request,
            "PUT",
        )
        return result

I'll first try to get a mock example to work, and will proceed from there. Will keep you posted on progress here.

@eliocamp read along if interested, I might need your input in implementing the workflow method in the R6 classes.

eliocamp commented 1 year ago

I am interested.

So is this another API endpoint that the python library exposes via its service() method? Does it support any arbitrary computation or only the limited number of statistics listed in the documentation?

khufkens commented 1 year ago

The below example works and generates daily mean temperatures for the January 2020. Note that you have to cycle through months as per example (pdf you linked to). Querying multiple months, or a single day by using an additional "day" parameter does not work.

library(ecmwfr)
library(terra) # to visualize things

# set your user ID
user <- "YOUR_CDS_ID"

# get key (as normally - hidden from users)
key <- wf_get_key(user = user, service = "cds")

# This query works
# single month of mean daily temperature
request = list(
  "args" = "",
  kwargs = list(
    params = list(
      kwargs = list(
        "dataset" = "reanalysis-era5-single-levels",
        "product_type" = "reanalysis",
        "variable" = "2m_temperature",
        "statistic" = "daily_mean",
        "year" = "2020",
        "month" = "01",
        "time_zone" = "UTC+00:0",
        "frequency" = "1-hourly",
        "grid" = "0.25/0.25",
        "area" = list(
          lat = list(38, 60),
          lon = list(-20,20)
        )
      ),
      "workflow_name" = "application",
      "realm" = "c3s",
      "project" = "app-c3s-daily-era5-statistics",
      "version" = "master"
    )
  )
)

# Get the response for the query provided
# from the correct endpoint
response <- httr::PUT(
  sprintf(
    "%s/tasks/services/%s/clientid-%s",
    ecmwfr:::wf_server(service = "cds"),
    gsub("\\.", "/","tool.toolbox.orchestrator.workflow"),
    uuid::UUIDgenerate(output = "string")
  ),
  httr::authenticate(user, key),
  httr::add_headers("Accept" = "application/json",
                    "Content-Type" = "application/json"),
  body = request,
  encode = "json"
)

# The returned content, this is not the data
# only where to get the data if successful
ct <- httr::content(response)
print(ct)

# You can reconstitute the URL where to download things
# as such
url <- ecmwfr:::wf_server(id = ct$request_id, service = "cds")

# update the response to see if the download
# is available (you can check this on the CDS page)
# but in the package for normal requests this query
# is cycled until "completed", see below
response <- httr::GET(
  url,
  httr::authenticate(user, key),
  httr::add_headers("Accept" = "application/json",
                    "Content-Type" = "application/json"),
  encode = "json"
)

# grab content from response
ct <- httr::content(response)

# if the response states that it is completed
# grab the location from the response element and
if (ct$state == "completed") {
  download.file(
    ct$result[[1]]$location,
    "./test.nc"
  )
}

r <- terra::rast("./test.nc")
terra::plot(r)

khufkens commented 1 year ago

@eliocamp see above, this solution works and should be relatively easily integrated into the existing R6 framework as it follows the workflow almost to the letter. Only the file location is embedded into a deeper list level as per ct$result[[1]]$location above.

Took some data sleuthing in the python cdsapi but generally an easy fix. Thanks @ainsliej for pointing me to this.

khufkens commented 1 year ago

@eliocamp I only think the stuff listed in the document will be supported, this is not a general purpose endpoint. But it provides a fair number of products to summarize.

As I understand it the endpoint really taps into some backend runner, which does the work (given forwarded application parameters) and then dumps data into the CDS requests framework for managing data egress (downloads).

Technically you should be able to tap into other applications (on the CDS website) and grab output data as well (if you know how to format the parameter correctly, and how the app is named).

khufkens commented 1 year ago

To be specific, the below settings (as included in the query) specify it is an application, which needs to be downloaded from the realm (gitlab account as far as I could tell from debugging) and grab the repository app-c3s-daily-era5-statistics and the master branch of it.

      "workflow_name" = "application"
      "realm" = "c3s"
      "project" = "app-c3s-daily-era5-statistics"
      "version" = "master"

eliocamp commented 1 year ago

Ah. So someone creates an application in the CDS thingy and then you can query it with this code? So this code uses a specific application that computes averages?

khufkens commented 1 year ago

Exactly! Technically if you create an App online in the CDS toolbox editor you should be able to query the output via these means as well if you can figure out the bits and bops of interfacing. But, the source code for the above example is online so this shouldn't be too hard.

https://cds.climate.copernicus.eu/cdsapp#!/software/app-c3s-daily-era5-statistics?tab=appcode

This opens top opportunity to implement point wise data extraction! Which currently isn't easily done using the default API.

eliocamp commented 1 year ago

Very interesting! I'm now thinking about creating specific R packages to get preprocessed data using idiomatic code with ecmwfr as the backend... but I digress.

So would be a new wf_application() or something else? Is there a GUI to select all the parameters and get code that we can then automatically translate to R code like we do with the requests? Or will the user need to understand a bit more about each specific application to build a valid request?

khufkens commented 1 year ago

There is a GUI but only that, it doesn't return an API query explicitly. I guess the closest thing would be to copy the URL once all parameters are set. Obviously if you only need one dataset you can then just punch download, but the point is to automate these things (if not using constructor functions).

This is the GUI: https://cds.climate.copernicus.eu/apps/user-apps/app-c3s-daily-era5-statistics

The url will set the defaults, or settings and can be parsed I guess - this would be the easiest way to avoid user input errors (which are the most common issues listed here).

We can fold this into wf_request() or keep it separate for now, and see where it goes or how far we can stretch this. I think it has potential (as you can move some compute to ECMWF servers).

khufkens commented 1 year ago

Alternatively, you mimick the GUI and check for stuff on the package side of things (arguably this is more maintenance and might not be sustainable if more such applications come online).

Regardless of the approach taken, I fear this is a rather advanced functionality, but potentially one that will see high demand (from the ecology community I know the demand for daily data will be high).

eliocamp commented 1 year ago

Yeah, no, this needs to be a more generic support for any application, I think.

I guess the closest thing would be to copy the URL once all parameters are set.

That might do it, but is this generic enough for any application or is it specific for this one?

khufkens commented 1 year ago

@eliocamp Plot twist !!!

The cds toolbox editor lets you dump a whole serialized piece of code into the API and execute it - I don't think it needs to be in your editor workspace at all!

See example below, which is the basic point extraction and plot data setup (example 03) but instead I return the data element not the figure.

This works! Data needs to be read with ncdf4 as not geospatial. I hope they lock down their system well because this is like SQL injection galore if not. :fearful:

library(ecmwfr)

# set your user ID
user <- "2088"

# get key (as normally - hidden from users)
key <- wf_get_key(user = user, service = "cds")

# CDS Toolbox editor python code
code <- "import cdstoolbox as ct\n\nlayout = {\n    'input_ncols': 3,\n}\n\nvariables = {\n    'Near-Surface Air Temperature': '2m_temperature',\n    'Eastward Near-Surface Wind': '10m_u_component_of_wind',\n    'Northward Near-Surface Wind': '10m_v_component_of_wind',\n    'Sea Level Pressure': 'mean_sea_level_pressure',\n    'Sea Surface Temperature': 'sea_surface_temperature',\n}\n\n\n@ct.application(title='Extract a time series and plot graph', layout=layout)\n@ct.input.dropdown('var', label='Variable', values=variables.keys(), description='Sample variables')\n@ct.input.text('lon', label='Longitude', type=float, default=75., description='Decimal degrees')\n@ct.input.text('lat', label='Latitude', type=float, default=43., description='Decimal degrees')\n@ct.output.livefigure()\ndef plot_time_series(var, lon, lat):\n    \"\"\"\n    Application main steps:\n\n    - set the application layout with 3 columns for the input and output at the bottom\n    - retrieve a variable over a defined time range\n    - select a location, defined by longitude and latitude coordinates\n    - compute the daily average\n    - show the result as a timeseries on an interactive chart\n\n    \"\"\"\n\n    # Time range\n    data = ct.catalogue.retrieve(\n        'reanalysis-era5-single-levels',\n        {\n            'variable': variables[var],\n            'grid': ['3', '3'],\n            'product_type': 'reanalysis',\n            'year': [\n                '2008', '2009', '2010',\n                '2011', '2012', '2013',\n                '2014', '2015', '2016',\n                '2017'\n            ],\n            'month': [\n                '01', '02', '03', '04', '05', '06',\n                '07', '08', '09', '10', '11', '12'\n            ],\n            'day': [\n                '01', '02', '03', '04', '05', '06',\n                '07', '08', '09', '10', '11', '12',\n                '13', '14', '15', '16', '17', '18',\n                '19', '20', '21', '22', '23', '24',\n                '25', '26', '27', '28', '29', '30',\n                '31'\n            ],\n            'time': ['00:00', '06:00', '12:00', '18:00'],\n        }\n    )\n\n    # Location selection\n\n    # Extract the closest point to selected lon/lat (no interpolation).\n    # If wrong number is set for latitude, the closest available one is chosen:\n    # e.g. if lat = 4000 -> lat = 90.\n    # If wrong number is set for longitude, first a wrap in [-180, 180] is made,\n    # then the closest one present is chosen:\n    # e.g. if lon = 200 -> lon = -160.\n    data_sel = ct.geo.extract_point(data, lon=lon, lat=lat)\n\n    # Daily mean on selection\n    data_daily = ct.climate.daily_mean(data_sel)\n\n    fig = ct.chart.line(data_daily)\n\n    return data_daily\n"

# This query works
# single month of mean daily temperature
request = list(
      code = code,
      kwargs = list(
        "lat" = 43,
        "lon" = 75,
        "var" = "Near-Surface Air Temperature"
      ),
      "workflow_name" = "plot_time_series"
    )

# Get the response for the query provided
# from the correct endpoint
response <- httr::PUT(
  sprintf(
    "%s/tasks/services/%s/clientid-%s",
    ecmwfr:::wf_server(service = "cds"),
    # NOTE THE DIFFERENENT ENDPOINT FOR TOOLBOX EDITOR APPS ending in run_workflow
    gsub("\\.", "/","tool.toolbox.orchestrator.run_workflow"),
    uuid::UUIDgenerate(output = "string")
  ),
  httr::authenticate(user, key),
  httr::add_headers("Accept" = "application/json",
                    "Content-Type" = "application/json"),
  body = request,
  encode = "json"
)

# The returned content, this is not the data
# only where to get the data if successful
ct <- httr::content(response)
print(ct)

# You can reconstitute the URL where to download things
# as such
url <- ecmwfr:::wf_server(id = ct$request_id, service = "cds")

# update the response to see if the download
# is available (you can check this on the CDS page)
# but in the package for normal requests this query
# is cycled until "completed", see below
response <- httr::GET(
  url,
  httr::authenticate(user, key),
  httr::add_headers("Accept" = "application/json",
                    "Content-Type" = "application/json"),
  encode = "json"
)

# grab content from response
ct <- httr::content(response)

print(ct)

# if the response states that it is completed
# grab the location from the response element and
if (ct$state == "completed") {
  download.file(
    ct$result[[1]]$location,
    "./test.nc"
  )
}

khufkens commented 1 year ago

Ok, can confirm. You can just throw code at a run_workflow call, no need to have it available in your workspace.

eliocamp commented 1 year ago

This would be like creating a new anonymous application on the fly, wouldn't it? This is even more advanced.

I think this could be supported in a new function wf_cds_application() or something like that. One of the arguments would be the application, which could be an application name or the code (or, even better, a file that holds the code, since having the code inside a single string is very unwieldy).

I'm a bit confused on the difference between this:

request = list(
  "args" = "",
  kwargs = list(
    params = list(
      kwargs = list(
        "dataset" = "reanalysis-era5-single-levels",
        "product_type" = "reanalysis",
        "variable" = "2m_temperature",
        "statistic" = "daily_mean",
        "year" = "2020",
        "month" = "01",
        "time_zone" = "UTC+00:0",
        "frequency" = "1-hourly",
        "grid" = "0.25/0.25",
        "area" = list(
          lat = list(38, 60),
          lon = list(-20,20)
        )
      ),
      "workflow_name" = "application",
      "realm" = "c3s",
      "project" = "app-c3s-daily-era5-statistics",
      "version" = "master"
    )
  )
)

which seems to have all the parameters inside the first kwargs element.

And this

request = list(
      code = code,
      kwargs = list(
        "lat" = 43,
        "lon" = 75,
        "var" = "Near-Surface Air Temperature"
      ),
      "workflow_name" = "plot_time_series"
    )

which has the parameters as the top-level elements of the list. Does the format of the request depend on the application?

khufkens commented 1 year ago

Different endpoints!

The first one runs on: tool.toolbox.orchestrator.workflow

and side loads a git repo as far as I could tell from debugging

the second one runs the code in code clean at: tool.toolbox.orchestrator.run_workflow # note RUN there

And yes, you could create anonymous functions on the fly like this.

khufkens commented 1 year ago

And to answer your question last question, the format between the fully exposed workflows and the ones you forward to the cds toolbox editor are a bit different. I'm also not aware of how many "normal" workflows there are (formally released by ECMWF). Obviously you can code infinite "run_workflows".

khufkens commented 1 year ago

For reference, the formal python cdsapi parses these workflows and depending on the input queries the correct input. I guess it would be best to aim for something similar. In case of forwarding code, where the code comes from should be flexible indeed.

eliocamp commented 1 year ago

I'm also not aware of how many "normal" workflows there are (formally released by ECMWF).

There is a bunch.

Do they need to be formally released or could you run any workflow that you make yourself in the toolbox editor?

khufkens commented 1 year ago

The ones you link with docs are formal and run on the normal workflow. Question is if they all document their parameters well and if they return data, rather than figures and maps (seems to be a lot of the latter). Not sure if you can grab figures as well as data.

You can write whatever for the toolbox editor I think, if it runs there it runs from the API I think on run_workflow.

khufkens commented 1 year ago

I pushed my demo code to an analysis folder:

https://github.com/bluegreen-labs/ecmwfr/tree/master/analysis

I would suggest, have a look. This should explain a lot. And you can mess with things.

Seems like some of the workflow applications are embedded and out of reach :disappointed:

eliocamp commented 1 year ago

I noticed that you can copy the source code, put it in the code variable and use the run_workflow endpoint.

I grabbed this code, saved it in a file era5.py and then used this code in the test_run_workflow.R script.


code <- readLines("analysis/era5.py") |>
  paste0(collapse = "\n")

# This query works
# single month of mean daily temperature
request = list(
      code = code,
        kwargs = list(
          "dataset" = "reanalysis-era5-single-levels",
          "product_type" = "reanalysis",
          "variable" = "2m_temperature",
          "statistic" = "daily_mean",
          "year" = "2020",
          "month" = "01",
          "time_zone" = "UTC+00:0",
          "frequency" = "1-hourly",
          "grid" = "0.25/0.25",
          "area" = list(
            lat = list(38, 60),
            lon = list(-20,20)
          )
      ),
      "workflow_name" = "application"
    )

And it seemed to work.

I suspect that might make the code more reproducible since it doesn't depend on the remote application code. It might also make it a bit easier to create the request format, without all the nested list. Just the code and the arguments.

Then, it sees like the only difference with the basic request is the http verb and the URL. Right?

So to support the run_workflow thing we would only need to implement this part:

response <- httr::PUT(
  sprintf(
    "%s/tasks/services/%s/clientid-%s",
    ecmwfr:::wf_server(service = "cds"),
    # NOTE THE DIFFERENENT ENDPOINT FOR TOOLBOX EDITOR APPS
    gsub("\\.", "/","tool.toolbox.orchestrator.run_workflow"),
    uuid::UUIDgenerate(output = "string")
  ),
  httr::authenticate(user, key),
  httr::add_headers("Accept" = "application/json",
                    "Content-Type" = "application/json"),
  body = request,
  encode = "json"
)

Oh, and also remove a lot of the logic used to validate the datasets and search for the correct username and stuff.

khufkens commented 1 year ago

Yep, this makes sense. The app they put online is running the CDS Toolbox code, only exposed differently.

This could indeed save us the hassle of dealing with a gazillion workflows etc. Just implementing the run_workflow setup would cover most use cases then. Cool stuff.

eliocamp commented 1 year ago

And it's true that almost all applications return some kind of interactive figure (or "livefigure"), so it's not really applicable to what this package does.

I can try to add support for this.

I can create a cds_workflow object that inherits from cds_service and only changes the http verb and the URL.

khufkens commented 1 year ago

I'm calling it a day at my end. See where you get, but no rush. This has been sort of a happy accident stumbling upon this.

Drop me a line here if I need to pick something up tomorrow morning (I've got an hour long commute to fill), if not I'll have a go at it in a new branch (will let you know).

eliocamp commented 1 year ago

I'll work on it since I can use it to procrastinate on reworking a manuscript :P

khufkens commented 1 year ago

Ah, a true scholar I see :)

bluegreen-labs / ecmwfr

Daily averages using CDS API #107