kvantricht commented 9 months ago

Using demo dataset from #40 and writing documentation along the way in #42 .

kvantricht commented 9 months ago

@GriffinBabe, my initial review of test extractions:

S2 file “S2_at2021lpis809720_32633_2020-08-30_2022-03-03”

[ ] • In general, we should maybe ask for feedback from Daniele and both Jeroens on the files too
[x] • “mask_scl_dilation” mask not present in file; we cannot generate it afterwards
[x] • Some more metadata in file could be useful (such as extraction date, job launch user, title, description, mentioning GFMAP, resolution, start_date, end_date
[x] • S2 dtype is Int16, why not UINT16? (which would require different nodata value which is now at -32767)
[x] • What rule was used (if any) to drop observations that are certainly completely clouded?
[x] • First file I opened in QGIS has corrupt “-1” values around two borders. This value is not the no-data value so will be treated as valid which is problematic.
[x] • Maybe related to this, we end up now with a 65X65 file which could be a bit inconvenient. Probably we should have 64X64
[x] • SCL band is also INT16: could we in a post-job action change this to UINT8 to save storage or would that be too complex and not really worth it?
[x] • File name pattern should probably contain the resolution (and maybe patch size? Not sure).

AUX file “AUX_at2021lpis809720_32633_2020-08-30_2022-03-03”

[x] • Should we replace “AUX” by “WORLDCEREAL”?
[x] • File is wrongly georeferenced. Needs new check once this is fixed.
[x] • Variable now called “CROPTYPE” while it doesn’t always contain crop type information. Just landcover is also possible. Maybe “LABEL”
[x] • Same comment about the 65X65 dimensions
[x] • Same comment about some metadata that could be useful
[x] • Where does the attribute “landcover_label” come from?
[x] • Is it normal that “confidence” attribute is None?
[x] • Are we sure UINT16 covers the entire possible range of values?
[x] • There is no no-data value specified in the file. Which one do you use in rasterization process? Should be encoded as such in the netcdf so it’s handled properly downstream.
[x] • File name pattern should probably contain the resolution (and maybe patch size? Not sure).
[x] • Why is attribute “institution” still there and pointing to openEO platform Geotrellis backend? I think no OpenEO backend is involved in this file creation?

GriffinBabe commented 9 months ago

Already an answer to a couple of your points, I'm running a new extraction with some changes before answering to the others

• What rule was used (if any) to drop observations that are certainly completely clouded?

No strategy currently present, I think an easy one would be to drop observations with a cloud percentage higher than, let's say, 95% ? This can be done directly from the OpenEO load_collection feature

• SCL band is also INT16: could we in a post-job action change this to UINT8 to save storage or would that be too complex and not really worth it?

Probably not worth it, because you might save only half the space of one on the 13 bands. Also it's nice for later processing to have everything aligned on the same resolution, since anyways SCL is gonna be used for pixelwise operations on the optical data

• File name pattern should probably contain the resolution (and maybe patch size? Not sure).

I don't think it's necessary as all the tiles should be consistent within an extraction process

• Where does the attribute “landcover_label” come from?

It's something from the input dataset, but I can remove this because it's almost the same value for each sample

• Is it normal that “confidence” attribute is None?

It's also extracted from the input dataset, and it's None already from there for all samples

• Are we sure UINT16 covers the entire possible range of values?

Could you indicate me where is the list of the harmonized labels? This is decided by the user, but in the context of WorldCereal it should be matching the harmonized labels maximum value. Alternatively we can also put in to int64, which will accept negative values and can go on a very high range, it's still remaining a small file as such as it is only a 2D array (1 band, no time)

• There is no no-data value specified in the file. Which one do you use in rasterization process? Should be encoded as such in the netcdf so it’s handled properly downstream.

Yes indeed, I will add this as _FillValue as it seems to be the convention with NetCDF files (this source and in my past experiences)

GriffinBabe commented 9 months ago

Another question: how do you want the rasterization? With the all_touched parameter set to True or to False? https://rasterio.readthedocs.io/en/stable/api/rasterio.features.html#rasterio.features.rasterize

kvantricht commented 9 months ago

No strategy currently present, I think an easy one would be to drop observations with a cloud percentage higher than, let's say, 95% ? This can be done directly from the OpenEO load_collection feature

As a minimum, indeed good to have this added. We could make it configurable in GFMAP?

I don't think it's necessary as all the tiles should be consistent within an extraction process

Not sure about this, I think we would prefer to extract S1 at 20m and meteo even at much lower resolution. This saves a lot of storage. Downstream OpenEO-based processing will finally merge the cubes.

It's something from the input dataset, but I can remove this because it's almost the same value for each sample

Should not always be the same. 11 means cropland, but there will be other datasets (not the one I sent) which have other landcovers as well. But the crucial thing here is that this attribute only belongs to the center point/field we used and it doesn't necessarily fit the rest of the pixels in the rasterized patch. So I would omit this attribute as it will be confusing.

It's also extracted from the input dataset, and it's None already from there for all samples

Interesting. I guess once we use the API, this shouldn't be None. So let's worry about it later.

Could you indicate me where is the list of the harmonized labels? This is decided by the user, but in the context of WorldCereal it should be matching the harmonized labels maximum value. Alternatively we can also put in to int64, which will accept negative values and can go on a very high range, it's still remaining a small file as such as it is only a 2D array (1 band, no time)

The possible values are in the first column of this file. Note that this is a string for readability and we should strip all - signs to get to the integer. I think int64 is indeed what is normally used for the new legend.

Another question: how do you want the rasterization? With the all_touched parameter set to True or to False?

As discussed, all_touched should be False

kvantricht commented 8 months ago

@GriffinBabe

[x] Todo: file a bug report for OpenEO on the -1 border issue with tehe extractions

GriffinBabe commented 8 months ago

Issue reported: https://github.com/Open-EO/openeo-geopyspark-driver/issues/712

GriffinBabe commented 8 months ago

Except for the scl problem, all is solved here. I created a ticket in WorldCereal technical and open-eo geopyspark driver https://github.com/Open-EO/openeo-geopyspark-driver/issues/715

Open-EO / openeo-gfmap

Run first demo WorldCereal-based extraction use case #43

S2 file “S2_at2021lpis809720_32633_2020-08-30_2022-03-03”

AUX file “AUX_at2021lpis809720_32633_2020-08-30_2022-03-03”

45 closes it