Element84 / earth-search

Earth Search information and issue tracking
https://earth-search.aws.element84.com/v1
27 stars 2 forks source link

Proposal: remove jpeg2000 assets from Sentinel2 L2A items #21

Closed jkeifer closed 8 months ago

jkeifer commented 9 months ago

The Sentinel 2 L2A items currently include a set of *_jp2 assets pointing to the "upstream" jpeg2000 files in the sentinel-s2-l2a bucket, which are used as the inputs for creating the COG assets.

Because the COG assets are simply a different format of the same assets, we end up with a lot of duplication within the metadata. For example, here is an example COG asset (from a development version of the processor):

        "red": {
          "href": "/roda-sentinel2/workflow-sentinel-2-c1-l2a-to-stac/tiles-1-L-AC-2023-4-18-0/B04.tif",
          "type": "image/tiff; application=geotiff; profile=cloud-optimized",
          "title": "Red (band 4) - 10m",
          "view:azimuth": 141.273664333548,
          "view:incidence_angle": 2.94395846985518,
          "eo:bands": [
            {
              "name": "red",
              "common_name": "red",
              "description": "Red (band 4)",
              "center_wavelength": 0.665,
              "full_width_half_max": 0.038
            }
          ],
          "gsd": 10,
          "proj:shape": [
            10980,
            10980
          ],
          "proj:transform": [
            10.0,
            0.0,
            99960.0,
            0.0,
            -10.0,
            8300020.0
          ],
          "raster:bands": [
            {
              "nodata": 0,
              "data_type": "uint16",
              "bits_per_sample": 15,
              "spatial_resolution": 10,
              "scale": 0.0001,
              "offset": -0.1
            }
          ],
          "file:checksum": "12201fd4161e16dd3ead6f18f7ea884228c6b3769b07d11137a2c42a175588461a71",
          "file:size": 150650610,
          "roles": [
            "data",
            "reflectance"
          ]
        },

And here is the _jp2 version:

        "red_jp2": {
          "href": "s3://sentinel-s2-l2a/tiles/1/L/AC/2023/4/18/0/R10m/B04.jp2",
          "type": "image/jp2",
          "title": "Red (band 4) - 10m",
          "view:azimuth": 141.273664333548,
          "view:incidence_angle": 2.94395846985518,
          "eo:bands": [
            {
              "name": "red",
              "common_name": "red",
              "description": "Red (band 4)",
              "center_wavelength": 0.665,
              "full_width_half_max": 0.038
            }
          ],
          "gsd": 10,
          "proj:shape": [
            10980,
            10980
          ],
          "proj:transform": [
            10.0,
            0.0,
            99960.0,
            0.0,
            -10.0,
            8300020.0
          ],
          "raster:bands": [
            {
              "nodata": 0,
              "data_type": "uint16",
              "bits_per_sample": 15,
              "spatial_resolution": 10,
              "scale": 0.0001,
              "offset": -0.1
            }
          ],
          "roles": [
            "data",
            "reflectance"
          ]
        },

The only differences in the above are:

The red_jp2 asset with no white space is 579 bytes, of which the differences account for only 67. One test item example was 22446 bytes with the _jp2 assets versus 13267 bytes without.

In the case of gzipped transfers or storage on disk, that same test item with the default level 6 gzip compression totaled 3057 bytes with and 2563 bytes without.

Retaining the _jp2 assets therefore has a meaningful cost:

That said, keeping the _jp2 assets does provide something of an audit trail that allows users to easily check our work. Ensuring we don't make doing so too difficult is important.

To that end, can we find a compromise that provides a way to access the _jp2 assets relatively easily for those that want them without having to bloat the items so significantly?

Some ideas that have been proposed:

  1. add links to the metadata files in the sentinel-s2-l2a bucket alongside the jpeg2000 files (reltype via)
  2. keep the _jp2 assets, but remove all duplicate metadata (retains href and type, and potentially file info fields if we we add any)
  3. add a links field to the assets, and add a link to each asset with reltype derived_from pointing to the input jpeg2000 href (this option deviates from core STAC, but could be encapsulated in an extension)

Thoughts? Is a change here a dealbreaker for anyone? Any of the proposed ideas stand out as better or worse? Any other ideas? Any points I've missed?

Any decision made for this issue will be documented in an ADR in this repo.

jkeifer commented 9 months ago

One argument for retaining the _jp2 assets (in some form) is that they can be downloaded easily, such as by running stac-asset download item.json, which will fetch all assets by default.

An argument against that point is that users would then, by default, get both the jpeg2000 and COG files, which are the same data in different formats. I believe seems more reasonable to an end user to only get one copy of the data by default, and to get the COG format.

Thus, I would argue that this points to a need to remove the _jp2 assets. I would at this point lean towards the third option of adding a links field to the assets as a compromise between bloat and specificity.

gadomski commented 9 months ago

Would it be an abuse of https://github.com/stac-extensions/alternate-assets to use it here? E.g.:

"B01": {
  "href": "s3://bucket/path/B01.tif",
  "alternate": {
    "jp2k": {
      "href": "s3://bucket/path/B01.jp2"
    }
  }
}
philvarner commented 9 months ago

Would it be an abuse of https://github.com/stac-extensions/alternate-assets to use it here?

I think it would be -- the intention with that is that it's the same asset (e.g., same bytes), just with a different URI, whereas in this case it's a different set of bytes representing the same data.

philvarner commented 9 months ago

I'm in favor of doing (1) and (3) and removing the jp2 links.

jp2 is not an inherently cloud-native format, so it doesn't really fit in with this ecosystem. Maybe there's some case where someone needs a jp2 for their existing workflow, but, sorry, you'll have to write some extra code to find an process those assets directly.

Currently, to get to the original granule_metadata path (without just stripping the filename from one of the jp2 asset hrefs), a user would have to follow the derived_from link to the sentinel-2-l2c item, and then the granule_metadata asset href.

I'm not sure via is the right reltype here, and links are usually only used for relationships among the STAC entities rather than to assets (even when the asset is metadata). We could also do one of these: (1) add an alternate-assets reference to the original metadata in the granule_metadata asset, (2) add a new asset with a link to the original granule metadata or the S3 path prefix of the scene (kinda bad because that's technically not a URL), or (3) add a property with the S3 path prefix of the scene.

I think this aligns with the existing use of derived_from, and it would be easy to create an extension for it.

the derived_from documentation says "URL to a STAC Item that was used as input data in the creation of this Item.", which would align -- the jp2 was used as the input data to create the cog.

And also says: "Note regarding the type derived_from: A full provenance model is far beyond the scope of STAC, and the goal is to align with any good independent spec that comes along for that. But the derived_from field is seen as a way to encourage fuller specs and at least start a linking structure that can be used as a jumping off point for more experiments in provenance tracking"

I don't think we want to go any further here with provenance tracking, but derived_from seems sufficient.

philvarner commented 8 months ago

Decision made to remove jp2 links.