SeaBee-no / documentation

Repo for all SeaBee documentation
https://seabee-no.github.io/documentation/
0 stars 0 forks source link

Standardise `NoData` handling and value scaling for rasters published to GeoNode #30

Open JamesSample opened 1 year ago

JamesSample commented 1 year ago

Our handling of NoData values in SeaBee raster imagery is inconsistent and should be standardised, if possible.

So far, I have seen the following examples of SeaBee orthomosaics:

Bit depth

For the ML, we have so far converted everything to 8-bit. This loses some information from the MS data, but I don't think it's a big issue for now and the benefits are (i) substantial reductions in file size, and (ii) better compatibility with the RGB data.

For now, I propose that we continue to convert everything to 8-bit before publishing on GeoNode. At some point, it might be worth evaluating whether we gain any improvements in ML performance using 32-bit MS data, but I don't think this is high priority in the short term.

Handling NoData

I believe we have three options:

  1. Preserve the alpha channel as it is
  2. Discard the alpha channel and explicitly set a NoData value
  3. Convert the alpha channel to a "mask"

I don't like option 1 because datasets with alpha channels have some limitations compared to those without. For example, GeoServer cannot handle JPEG-compressed RGBA GeoTiffs. Perhaps we don't need this feature right now, but it's nice to have the possibility later, if needed.

Option 2 is what I have done so far. This is simple and in general works well, but it needs careful handling for "edge cases". In particular, the 8-bit RGBA mosaics often use the full range of values (0 to 255) for data. This leads to issues if we arbitrarily assign e.g. (255, 255, 255) as NoData without first scaling the values. For example, seabirds with very white plumage may end up classified as NoData, which is obviously not good (see below)!

image

A simple solution is to change all genuine data values of 255 to 254 first, and then use 255 for NoData. This is an artificial adjustment, but in practice it will probably make no discernible difference to the ML.

Note: Handling the RGBA images from Spectrofly is similar, except in the past we have used 0 as the NoData value instead of 255. The same issues apply though: genuinely black areas will be wrongly converted to NoData, unless they are artificially adjusted to e.g. (1, 1, 1) first.

Option 3 is worth investigating. Masks behave very similarly to alpha channels, except I think they have more comprehensive support for geoprocessing tasks. It is also easy to convert an existing alpha channel to a mask using e.g.

gdal_translate in.tif out.tif -b 1 -b 2 -b 3 -mask 4 --config GDAL_TIFF_INTERNAL_MASK YES

This results in a genuine 3-band GeoTIFF where the NoData mask is stored internally. I need to read the docs and experiment a little to find out how this works in detail, but it could be a good option.

@jarlehr - Do you/NR have any strong preference for how we define NoData in the raster orthomosaics we provide for the ML? So far, I have always used option 2 for the data I send to NR, but happy to consider alternatives.

@awigeon @knl88 @jarlehr - If we decide to stick with option 2, are you happy with the idea of manually adjusting genuine pure white or black areas by one unit, in order to free up a NoData value (i.e. [0, 0, 0] => [1, 1, 1] or [255, 255, 255] => [254, 254, 254])? I can't see this making any difference to the ML. Do you agree?

@jlgarrett - What is your approach with the HSI? Are the bands 8-bit, or 32-bit or something else? And how do you currently represent NoData?

Thanks!

jarlehr commented 1 year ago

Hi,

We use the rasterio Python library to read the geotiff files. It provides a dictionary with meta information where we find the nodata value, if option 2 is used. I think it is possible to read mask bands according to option 3, however, we have not implemented this. Hence, for us it is better to (continue to) use option 2 with a nodata value set in the meta data.

As for the adjustment of values from 0->1 or 255->254, we don't think will have any notable impact on the performance of the ML algorithms.