Open-EO / openeo-python-client

Python client API for OpenEO
https://open-eo.github.io/openeo-python-client/
Apache License 2.0
149 stars 38 forks source link

Downloading STAC files in download_files #184

Open m-mohr opened 3 years ago

m-mohr commented 3 years ago

For CARD4L compliance of downloaded jobs, the client needs to also store the STAC files. As far as I can see that is not the case yet.

Also, it doesnt't seem to follow sub-catalogs or item links. Not required by API 1.0 but for the ARD download (and that would be future-proof already).

Maybe we can write something universal (or use PySTAC - that will probably handle more edge cases?) so that I could also re-use them for the download implementation in #179? Something that crawls through the STAC catalog and downloads all assets including the STAC files it reads on its way.

m-mohr commented 3 years ago

For CARD4L for each data file (e.g. a scene) you need to also store at least a STAC file with the CARD4L metadata.

So for example you get a STAC collection that links to three 3 STAC items.

GET /jobs/:id/results

{
  id: "pre-computed-card4l",
  ...
  assets: [
    ... contains all downloadable assets for compatibility reasons
  ],
  links: [
   { href: 'item1.json', rel: 'item' },
   { href: 'item2.json', rel: 'item' },
   { href: 'item2.json', rel: 'item' }
  ]
}

item1/2/3.json:

{
  id: "item-1/2/3",
  ...
  card4l:specification: ...,
  eo:cloud_cover: ...,
  ... all card4l metadata
  assets: [
    ... contains all assets for this item
  ],
  links: []
}

So in the end you need to follow the child/item links in the collection and store all those files (together with the result from GET /jobs/:id/results), too. All those files you go through need to be stored as well as they contain the metadata.

m-mohr commented 3 years ago

You may also need to adjust the links in the stored files as the location changes and the references to links and assets could get corrupt. Overall, this is a bit of an effort so I'm thinking whether it could make sense to try whether PySTAC could do the work for us.

soxofaan commented 3 years ago

To tackle this properly, we have to include this in our internal sprint planning, I think. What do you think @jdries ?

jdries commented 3 years ago

So to list the current state in client and backend. The backend now returns this when requesting job metadata:

results = my_job.get_results()
results.get_metadata()

Returns:

{
   "assets":{
      "s1_rtc_02F97C_N46E010_2021_01_03_MULTIBAND.tif":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_02F97C_N46E010_2021_01_03_MULTIBAND.tif",
         "type":"image/tiff; application=geotiff"
      },
      "s1_rtc_02F97C_N46E010_2021_01_03_metadata.json":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_02F97C_N46E010_2021_01_03_metadata.json",
         "type":"application/json"
      },
      "s1_rtc_02F97C_N46E011_2021_01_03_MULTIBAND.tif":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_02F97C_N46E011_2021_01_03_MULTIBAND.tif",
         "type":"image/tiff; application=geotiff"
      },
      "s1_rtc_02F97C_N46E011_2021_01_03_metadata.json":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_02F97C_N46E011_2021_01_03_metadata.json",
         "type":"application/json"
      },
      "s1_rtc_02FB1E_N46E010_2021_01_07_MULTIBAND.tif":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_02FB1E_N46E010_2021_01_07_MULTIBAND.tif",
         "type":"image/tiff; application=geotiff"
      },
      "s1_rtc_02FB1E_N46E010_2021_01_07_metadata.json":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_02FB1E_N46E010_2021_01_07_metadata.json",
         "type":"application/json"
      },
      "s1_rtc_02FB1E_N46E011_2021_01_07_MULTIBAND.tif":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_02FB1E_N46E011_2021_01_07_MULTIBAND.tif",
         "type":"image/tiff; application=geotiff"
      },
      "s1_rtc_02FB1E_N46E011_2021_01_07_metadata.json":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_02FB1E_N46E011_2021_01_07_metadata.json",
         "type":"application/json"
      },
      "s1_rtc_04393F_N46E010_2021_01_08_MULTIBAND.tif":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_04393F_N46E010_2021_01_08_MULTIBAND.tif",
         "type":"image/tiff; application=geotiff"
      },
      "s1_rtc_04393F_N46E010_2021_01_08_metadata.json":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_04393F_N46E010_2021_01_08_metadata.json",
         "type":"application/json"
      },
      "s1_rtc_04393F_N46E011_2021_01_08_MULTIBAND.tif":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_04393F_N46E011_2021_01_08_MULTIBAND.tif",
         "type":"image/tiff; application=geotiff"
      },
      "s1_rtc_04393F_N46E011_2021_01_08_metadata.json":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_04393F_N46E011_2021_01_08_metadata.json",
         "type":"application/json"
      },
      "s1_rtc_043A0E_N46E010_2021_01_09_MULTIBAND.tif":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_043A0E_N46E010_2021_01_09_MULTIBAND.tif",
         "type":"image/tiff; application=geotiff"
      },
      "s1_rtc_043A0E_N46E010_2021_01_09_metadata.json":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_043A0E_N46E010_2021_01_09_metadata.json",
         "type":"application/json"
      },
      "s1_rtc_043A0E_N46E011_2021_01_09_MULTIBAND.tif":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_043A0E_N46E011_2021_01_09_MULTIBAND.tif",
         "type":"image/tiff; application=geotiff"
      },
      "s1_rtc_043A0E_N46E011_2021_01_09_metadata.json":{
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results/s1_rtc_043A0E_N46E011_2021_01_09_metadata.json",
         "type":"application/json"
      }
   },
   "bbox":[
      10.993537902832033,
      46.44258864468262,
      11.706619262695314,
      46.53713734839792
   ],
   "geometry":{
      "coordinates":[
         [
            [
               10.993537902832033,
               46.44258864468262
            ],
            [
               10.993537902832033,
               46.53713734839792
            ],
            [
               11.706619262695314,
               46.53713734839792
            ],
            [
               11.706619262695314,
               46.44258864468262
            ],
            [
               10.993537902832033,
               46.44258864468262
            ]
         ]
      ],
      "type":"Polygon"
   },
   "id":"10769e53-956b-480b-b977-2d71977fc4f6",
   "links":[
      {
         "href":"https://openeo-dev.vito.be/openeo/1.0/jobs/10769e53-956b-480b-b977-2d71977fc4f6/results",
         "rel":"self",
         "type":"application/json"
      },
      {
         "href":"http://ceos.org/ard/files/PFS/SR/v5.0/CARD4L_Product_Family_Specification_Surface_Reflectance-v5.0.pdf",
         "rel":"card4l-document",
         "type":"application/pdf"
      }
   ],
   "properties":{
      "created":"2021-03-05T10:28:22Z",
      "datetime":"None",
      "end_datetime":"2021-01-10T00:00:00Z",
      "processing:lineage":{
         "process_graph":{
            "ardnormalizedradarbackscatter1":{
               "arguments":{
                  "data":{
                     "from_node":"loadcollection1"
                  },
                  "elevation_model":"None",
                  "ellipsoid_incidence_angle":false,
                  "noise_removal":true
               },
               "process_id":"ard_normalized_radar_backscatter"
            },
            "discardresult1":{
               "arguments":{
                  "data":{
                     "from_node":"ardnormalizedradarbackscatter1"
                  }
               },
               "process_id":"discard_result"
            },
            "loadcollection1":{
               "arguments":{
                  "bands":[
                     "VV",
                     "VH"
                  ],
                  "id":"SENTINEL1_GRD",
                  "spatial_extent":{
                     "crs":"EPSG:4326",
                     "east":11.706619262695314,
                     "north":46.53713734839792,
                     "south":46.44258864468262,
                     "west":10.993537902832033
                  },
                  "temporal_extent":[
                     "2021-01-01",
                     "2021-01-10"
                  ]
               },
               "process_id":"load_collection"
            },
            "saveresult1":{
               "arguments":{
                  "data":{
                     "from_node":"discardresult1"
                  },
                  "format":"NetCDF",
                  "options":{

                  }
               },
               "process_id":"save_result",
               "result":true
            }
         }
      },
      "start_datetime":"2021-01-01T00:00:00Z",
      "updated":"2021-03-05T10:59:39Z"
   },
   "stac_extensions":[
      "processing"
   ],
   "stac_version":"0.9.0",
   "type":"Feature"
}

Then he can download all files with: results.download_files() That downloads both json metadata and tif files. So question is, what is still missing here? We don't seem to have the links to json items, is that it?

soxofaan commented 3 years ago

currently the client downloads the files under "assets" field of /jobs/{}/results

what is requested here is to also walk the "item" links under the "links" field, parse them and download the "assets" from these as well.

soxofaan commented 3 years ago

This seems bit in contradiction with the "links" documentation (https://github.com/Open-EO/openeo-api/blob/81dd248b9e8b8caeedbdd96c6a9dd8f3f56d0eb7/openapi.yaml#L3613-L3622):

... The links MUST NOT contain links to the processed and downloadable data. Instead specify these in the assets property.

m-mohr commented 3 years ago

The current API assumed just one metadata file, which contains all assets. This will be still true until API 2.0 (see https://github.com/Open-EO/openeo-api/pull/359), but it won't be enough to get CARD4L compliant data to the user as there you'll need multiple metadata files for e.g. SAR. Those files need to be stored as well and that's not the case yet. Strictly, we would not need to store the assets from those files as they are already listed at the first file for backward-compatibility, but long-term it would still be a good idea.

By the way, the example above is far from being CARD4L compliant. It's a single file for multiple "scenes" it seems and doesn't have any details about the source data.

Here's another example, which hopefully clarifies what we'd need for 2 SAR tiles:

-> collection (as returned by GET /jobs/:id/result, contains all assets (e.g. tile1.tif, tile2.tif) for compatibility reasons and minimal metadata, links to child items) ---> item1.json (contains assets for tile 1, e.g. tile1.tif, links to STAC item for source data with different rel type) -----> item1_source.json (contains source assets + metadata) ---> item2.json (contains assets for tile 2, e.g. tile2.tif, links to STAC item for source data with different rel type) -----> item1_source.json (contains source assets + metadata)

What we need to store is at least all JSON files above and tile1/2.tif. Storing the collection would also be good. We don't need to store the source assets.

m-mohr commented 3 years ago

We should either try to use pystac or just write a generic "STAC walker" that does the work for us, which we could pass a STAC to from e.g. collection_items or get_results.

soxofaan commented 3 years ago

I don't know the current or target feature set of pystac yet, but I'm wondering if it is necessary to replicate/wrap STAC functionality in openeo python client library if pystac already provides this.

Like openeo python client, pystac is still in pre-1.0 development phase, so adding pystac as dependency might cause some annoyances at this point. But even in the longer it might make sense to avoid a hard dependency between the two projects. Majority of users will be familiar with mixing projects anyway (e.g. like they already do with numpy/scipy/pandas, matplotlib, shapely, ...).

But anyway: to further investigate pystac and experiment we should probably plan some time at VITO

m-mohr commented 3 years ago

I'll give it a try today and see how far I get. It seems it's more aimed towards static catalogs, not APIs.

But I think downloading data is a core part of the workflow, so I'd expect that the Python client provides me with all data. Otherwise, we should not provide any downloading functionality at all in the client and leave it always up to the user to use external libraries for it. Because if there's a download function I'd expect that it gives me all the data...

jdries commented 3 years ago

ESA had a similar request in the RIDs: STAC json should be generated and stored next to downloaded files.

m-mohr commented 3 years ago

By the way, PySTAC will release a 1.0 soonish (another month or so). So that would probably be the best option for Python right now. There's also PySTAC client (more API focus) in the pipeline, so I wouldn't put too much effort on our side into a custom implementation and just wait for those...

soxofaan commented 2 years ago

As proof of concept I just added a simple addition to JobResults.download_files() to (by default) download the GET /jobs/{job_id}/results response as a file "job-results.json".

Some questions for the next iterations on this feature:

m-mohr commented 2 years ago

There is only a STAC best practice: https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#catalog-layout

Esentially, use catalog.json / collection.json and for Catalogs and Collections. Use <Item ID>.json for Items.

There's also some wording regarding the links/urls/paths, but that doesn't take away the decision. I assume you'd want to point to the downloaded files with relative links so that you can upload it e.g. to S3 without (larger) modifications. The self link is a bit tricky in this case and may need to be removed. The canonical link should point back to the original STAC metadata on the server.

soxofaan commented 2 years ago

Esentially, use catalog.json / collection.json and for Catalogs and Collections. Use .json for Items.

In the classic case the job result is a STAC item, so it's the latter. I assume you meant <id>.json instead of .json.

But does that mean that the filename would be something like 1b77ba83-c64e-4acc-ae5b-872743a781f4.json ? I wonder how user friendly such a cryptic filename is. In a URL those details don't really matter, but for a file on a user's file system, it's not very user friendly

m-mohr commented 2 years ago

Yeah, GitHub thought the <Item ID> is HTML and removed it.

In STAC that is usually not an issue as it is linked to from a catalog or collection. I think you could also simply name it item.json if there's only one item, otherwise, the Item ID makes sense with a catalog on top, I think.

soxofaan commented 2 years ago

ok thanks for the feedback

some remaining todo's for this ticket

jdries commented 3 months ago

Working on a script to download a full stac collection from openEO job result, trying to fix links. It's not super long yet, but I was expecting to find a tool for this.

import pystac

catalog = pystac.read_file(collection_url)

catalog.remove_links(rel="collection")
catalog.remove_links(rel="canonical")
items = list(catalog.get_stac_objects(rel=pystac.RelType.ITEM))
for i in items:
    i.remove_links(rel="collection")
    i.remove_links(rel="canonical")

catalog.set_self_href("/tmp/agera.json")
catalog.normalize_hrefs('/tmp/', skip_unresolved=True)

def asset_transform(name,a):
    print(a)
    a.href = "/tmp/" + name
    return a

c2=catalog.map_assets(asset_transform)

items2 = list(c2.get_stac_objects(rel=pystac.RelType.ITEM))

c2.save(catalog_type=pystac.CatalogType.SELF_CONTAINED)
m-mohr commented 3 months ago

Fyi: For asset download you could try https://github.com/stac-utils/stac-asset , @jdries

jdries commented 2 months ago

Thanks a lot! This is mostly what I was looking for. I gave it a try, unfortunately it seems to support download of ItemCollection but not the type of collection we have in our openEO backend.

m-mohr commented 2 months ago

@jdries Opening an issue in their repo may help to get it implemented.