Permian-Global-Research / rsi

Code for Retriving STAC Information, addressing Repeated Spatial Infelicities and interfacing with Rsome Spectral Indices
https://permian-global-research.github.io/rsi/
Apache License 2.0
42 stars 5 forks source link

sign function for Copernicus Data Space STAC #9

Open mateuszrydzik opened 9 months ago

mateuszrydzik commented 9 months ago

I tried to access data from Copernicus Data Space Ecosystem STAC using get_stac_data(), but it ends up returning 401 responses:

copernicus <- get_stac_data(
  aoi,
  start_date = "2023-01-01",
  end_date = "2023-01-31",
  stac_source = "https://catalogue.dataspace.copernicus.eu/stac/",
  collection = "SENTINEL-2",
  output_filename = tempfile(fileext = ".tif"),
  query_function = query_planetary_computer
)

If i understand correctly, both query_planetary_computer() and sign_planetary_computer() are based on rstac, which provides sign functions for only Planetary Computer and Brazil Data Cube. Are you planning on adding additional sign functions for other STAC APIs as a part of rsi library, or will it be more dependent or rstac development?

mikemahoney218 commented 9 months ago

I'd recommend opening an issue on rstac, which would be a more natural place for this function to live. If they don't respond or don't want to add a signing function, then maybe it could live in rsi, but I think rstac would make more sense.

As for getting that to work with rsi: the actual sign_planetary_computer() function is extremely straightforward, and basically just handles automatically using your PC credentials if they exist:

> rsi::sign_planetary_computer
function(items, subscription_key = Sys.getenv("rsi_pc_key")) {
  if (subscription_key == "") {
    rstac::items_sign(items, rstac::sign_planetary_computer())
  } else {
    rstac::items_sign(
      items,
      rstac::sign_planetary_computer(
        headers = c("Ocp-Apim-Subscription-Key" = subscription_key)
      )
    )
  }
}

And query_planetary_computer() is even simpler -- the main reason it exists is that some versions of AWS' STAC endpoints require post_request() rather than get_request(), and it feels nicer to be able to name the data source rather than needing to just know what HTTP method you need:

> rsi::query_planetary_computer
function(q, subscription_key = Sys.getenv("rsi_pc_key")) {
  rstac::get_request(q)
}

So if you know how to sign these items, and you know what HTTP method they need, it should be pretty straightforward to make rsi work with this endpoint.

Can you share the link to the STAC API you're using here? I'll confess I get pretty mixed up with all the different ESA endpoints.

mateuszrydzik commented 9 months ago

Thanks for the reply.

I checked rstac::get_request() and found that you can pass in httr::add_headers() with required tokens (e.g. get_request(add_headers("x-api-key" = "MY-TOKEN")). I will test if I can get any use out of it. If not, as you recommended, I will move this issue into rstac.

As for the Dataspace API, here is the link for the documentation with some examples. https://documentation.dataspace.copernicus.eu/APIs/STAC.html

The main catalog can be accessed with this link https://catalogue.dataspace.copernicus.eu/stac/ As an example, Sentinel-2 items are stored in https://catalogue.dataspace.copernicus.eu/stac/collections/SENTINEL-2/items

mikemahoney218 commented 9 months ago

Thank you! I'm not saying I'll add support for the Dataspace API soon, but I think it would make sense for there to at least be support in sentinel2_band_mapping for that endpoint, and will do that at some point (with a soft target of "before the first CRAN release in early 2024").

mikemahoney218 commented 8 months ago

Ah, I remember the issue now: the Dataspace STAC API returns assets that link to its OData service, which would require a different approach to downloading than the other STAC APIs that rsi currently works with.

The core issues are:

  1. the API requires passing tokens as headers, rather than signing URLs;
  2. data is returned as zip files,
  3. users are limited to only four downloads at a time,
  4. the API is so slow,
  5. links to assets time out for non-obvious reasons,
  6. items are composed of a preview image, and then a zipped tile of (presumably) all other relevant data

Dealing with issue 1 should be possible; GDAL can use a file of key: value headers when downloading via curl.

Dealing with issue 2 might be possible by using /vsizip/ when downloading from the Copernicus API. I'm still waiting for my trial download to finish to inspect what's actually in the downloaded file.

Dealing with issue 3 gets tricky, and will require not using (or at least limiting) parallelism when downloading from this API.

Issue 4 seems intractable :laughing:

Issue 5 might be due to trying a URL and getting a 401, then trying again; it might also be due to a time-out. The first of these is easy to deal with (don't retry failed downloads), the second would be harder (would need to have a way to re-query the API one downloads started failing).

Issue 6 just makes this endpoint unappealing, since users can't filter their downloads to only the relevant bands. Maybe if the vsizip trick works, this is something that can be controlled via the -b flag to gdalwarp?

All this said: I think this would take a bigger rewrite than I had expected, and wouldn't be super useful due to how slow the API is and the fact it returns zipped versions of entire tiles. I'm going to move this off the 0.1.0 milestone but leave the issue open, in case the API changes to use a more... normal way of sharing assets, or someone finds an easy way to work with this API in the same way as other STAC APIs.

Pure rstac example of downloading from this API (assuming the rsi_cdse_key envvar is an access token):

nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
ashe <- nc[1, ] |> 
  sf::st_transform(4326) |> 
  sf::st_bbox()

items <- rstac::stac("https://catalogue.dataspace.copernicus.eu/stac") |> 
  rstac::stac_search(
    collections = "SENTINEL-2",
    datetime = "2021-01-01/2021-12-31",
    bbox = ashe
  ) |> 
  rstac::get_request()

items$features <- items$features[1]

items |> 
  rstac::assets_download(
    config = httr::add_headers(
      Authorization = paste("Bearer", Sys.getenv("rsi_cdse_key"))
    )
  )
mikemahoney218 commented 8 months ago

A bit more context now that my download finished -- it seems like it took about 40 minutes on my residential connection to download just under 1GB of data.

I'm seeing now that GDAL's Sentinel-2 driver understands how to process these zip files directly, so it might not actually be that painful to rework the download method. Doing band name reassignments (and providing a friendly method for selecting specific bands) might be trickier.