Element84 / earth-search

Earth Search information and issue tracking
https://earth-search.aws.element84.com/v1
27 stars 2 forks source link

Options for handling offsets in Sentinel-2 COGs #23

Closed matthewhanson closed 8 months ago

matthewhanson commented 9 months ago

E84 is looking for feedback on how to properly handle the offsets in the Sentinel-2 data JP2 when they are converted to COGs for all new Sentinel-2 data with a processing baseline of 04.00 or higher.

This change introduced an offset that must be applied to the data and is explained here. Currently for the Sentinel-2 COGs, since Jan 2022, we apply the offset to the data. The offset is specified in the metadata, but currently has always been -0.1. The scale is 0.0001.

reflectance = scale*DN + offset

Going forward there are 4 options, ordered from worst to best, in my opinion.

1 - Leave the data as is

The COGs keep the same values as the JP2 files leaving the user responsible responsible for applying the offset. A serious issue with this is that when calculating simple band indices the scale of 0.0001 cancels out, e.g.,

NDVI = (nir - red)/(nir + red)

This is how a great number of notebooks and code out in the wild do this. If the offset is not applied the NDVI calculation becomes:

NDVI = (nir - red)/(nir + red - 2000)

which would break a large number of implementations. I would not recommend this.

2 - Apply offset as per the S2 Technical Guide

The recommended way for applying the offset is provided by ESA, however this as serious consequences.

With this method, any pixels 0 or less after applying the offset become equivalent to NODATA. While this seems reasonable at first because the pixel is invalid, the absence of collecting any data is not the same as measuring the surface and getting no meaningful signal. These are dark regions where the atmosphere has been overcompensated for. If they become 0 that tells the user there is no data rather than it being a dark object. This causes issues when visualizing as these dark regions would become transparent. For analysis this can cause gaps where there should not be. I would not recommend this approach.

3 - Apply offset, keep nodata=0

The current method used for the Sentinel-2 COGs maintains NODATA=0, but any values <-0 after applying the offset are set to a value of 1, which corresponds to a reflectance of 0.0001. The original nodata value locations are preserved. This indicates that data was there, but with a small value well below the noise level of the reflectance measurement.

This method would have virtually zero impact on any analysis of the data, however it is a change in the values, as is setting all negative data values to 1. It has the benefit of using the very common nodata value of 0. I find this to be an acceptable change.

4 - Apply offset, set nodata=65535

An alternative to the above is to change the nodata value to 65535, the max value for uint16. This has no chance to conflict with any valid data values which should not be higher than 11000 (reflectance ranges from 0-1.0). Then when the offset is applied is it clamped to 0, rather than 1.

piyushrpt commented 9 months ago

I would like to propose an Option 5 that may be consistent with some of the other missions.

Option 5:

Change datatype from UInt16 to Int16, apply the offset, set no data 32767/ -32768 and keep the scale at 0.0001. This way negative values will also be retained and users would not have to worry about applying the offset themselves.

If this is not possible, my preference would be option 1:

a. By modifying the source data range, we may be precluding uses / applications for which the changes to scaling were introduced in the first place. b. Scale/Offset is a concept that has a long history and has been widely used in Landsat missions as well. Just enforcing Scale/Offset to be always present enables bring Sentinel-2 into the common framework that spans multiple missions like Landsat, MODIS, commercial providers etc. Even if we eliminate the problem for Sentinel-2, users have to account for these in other missions. c. I would prefer an open dataset to be a true copy (in terms of imagery - with no loss of range ) of the original dataset.

ircwaves commented 9 months ago

I forgot to hit enter, and in the mean time, @piyushrpt put something very close to my thought: to change as little data as possible.

Option 4 is the closest to this, and supports the "notebooks in the wild don't need revision" criteria. But, I'm biased towards Option 5 (with either signed or unsigned datatype).

gadomski commented 9 months ago

c. I would prefer an open dataset to be a true copy (in terms of imagery - with no loss of range ) of the original dataset.

I agree with this. My opinion is that data "re-freers" ("re-sellers" but we we don't sell it) shouldn't try to "fix" data. If it comes with warts, we document and help people with the warts, but we don't remove them.

Corollary is that correcting actual errors/blunders is fine. This isn't an error, it's just a design choice that is a little awkward.

matthewhanson commented 9 months ago

Thanks @piyushrpt, I like that option a lot.

I disagree a bit with the idea that we shouldn't try to fix the data. In my opinion the entire goal of this exercise is to make access and use of Sentinel-2 data, because ESA has made decisions for the format and distribution of the data that have put up barriers to that. I think adding the offset falls under "making the data easier to use", especially with @piyushrpt's Option 5 which maintains all of the original values.

jkeifer commented 9 months ago

Looking into an unrelated matter, I observed that the tile I was processing had negative values (the red pixels) occurring only along the edge of the data:

areas of negative values

Granted, this is simply a single data point and I can't make any determination as to whether the findings here would be representative of all pixels in this state. That said, I do think it is worth probing the initial assumption that the pixels with negative reflectance values after atmospheric correction are valid data--in this case it looks like they are in fact erroneous values and probably best classified as NODATA.

piyushrpt commented 9 months ago

There may be a number of things happening here. I would also look into the values in the SCL band for these pixels. In our experience, using values from SCL classification are almost always needed. If you provide more information regarding the tile you are looking at - I can run some tests and confirm.

jkeifer commented 9 months ago

@piyushrpt The tile ID is S2A_OPER_MSI_L2A_TL_2APS_20230101T025803_A039309_T55GFL_N05.09 and the source tileInfo.json is at s3://sentinel-s2-l2a/tiles/55/G/FL/2022/12/31/0/tileInfo.json. Hopefully that gives you what you need to look into it, but if you need any other info let me know.

piyushrpt commented 9 months ago

Atleast in this case, all the SCL pixels are correctly labeled as open water and low reflectance observations over open water are not strange. But there could be other occasions, particularly on land where this could be an issue - like you said, this is just one data point.

matthewhanson commented 9 months ago

After thinking on this more and talking with more folks, I'm now leaning more toward Option 1 - leave the data as is. The reason for this is 2-fold: 1 - The original reason for applying the offset was so that time series could still be used across the Jan 25, 2022 offset change without having to adjust the data depending on processing baseline version. We are planning on reprocessing all the S2 data with the latest version, therefore there will no longer be a change in values.

2 - The use of scale/offset is common practice and while GDAL/rasterio do not automatically apply scale/offset, higher level tooling often does (e.g., QGIS, TiTiler). Users should always check and apply scale/offset in the data and we should not be encouraging bad practices. Setting the offset to 0 does not make things easier really as I suggested above, because users should be applying scale/offset properly.

Note that going forward, regardless of what end up doing, the scale/offset will be properly set both in the STAC metadata and in the COG files themselves.

A final decision has not been made yet, we are still looking for more user feedback.

sdtaylor commented 9 months ago

Agreed with option 1 and leaving the data as is. If you are concerned about breaking a bunch of examples, that is not unfounded. This sentinel2 collection specifically tends to be the go to open source data used in many examples and tutorials. But this is not a huge change to make. As others have said, scales/offsets are common and should be in tutorials anyway to introduce the concept.

tonykgill commented 9 months ago

We, at cibolabs.com.au, are comfortable applying the gain and offset ourselves. So, the proposed option 1 is fine.

If I understand what you are proposing, then users will have to:

For what it's worth, it will be a big improvement on the current, confusing, situation where we:

Other thoughts:

Finally, thank you. The Sentinel-2 COG collection is awesome. We appreciate your efforts in developing and maintaining it and considering our feedback.

matthewhanson commented 9 months ago

Thanks for the input @sdtaylor and @tonykgill .

@tonykgill correct, the gain and offset would be in the STAC Item and will also be set in the header metadata in the COGs themselves.

philvarner commented 8 months ago

For the preview dataset, we have decided not to apply scale or offset.