Closed kvantricht closed 8 months ago
@GriffinBabe, my initial review of test extractions:
[ ] • In general, we should maybe ask for feedback from Daniele and both Jeroens on the files too
[x] • “mask_scl_dilation” mask not present in file; we cannot generate it afterwards
[x] • Some more metadata in file could be useful (such as extraction date, job launch user, title, description, mentioning GFMAP, resolution, start_date, end_date
[x] • S2 dtype is Int16, why not UINT16? (which would require different nodata value which is now at -32767)
[x] • What rule was used (if any) to drop observations that are certainly completely clouded?
[x] • First file I opened in QGIS has corrupt “-1” values around two borders. This value is not the no-data value so will be treated as valid which is problematic.
[x] • Maybe related to this, we end up now with a 65X65 file which could be a bit inconvenient. Probably we should have 64X64
[x] • SCL band is also INT16: could we in a post-job action change this to UINT8 to save storage or would that be too complex and not really worth it?
[x] • File name pattern should probably contain the resolution (and maybe patch size? Not sure).
Already an answer to a couple of your points, I'm running a new extraction with some changes before answering to the others
• What rule was used (if any) to drop observations that are certainly completely clouded?
No strategy currently present, I think an easy one would be to drop observations with a cloud percentage higher than, let's say, 95% ? This can be done directly from the OpenEO load_collection feature
• SCL band is also INT16: could we in a post-job action change this to UINT8 to save storage or would that be too complex and not really worth it?
Probably not worth it, because you might save only half the space of one on the 13 bands. Also it's nice for later processing to have everything aligned on the same resolution, since anyways SCL is gonna be used for pixelwise operations on the optical data
• File name pattern should probably contain the resolution (and maybe patch size? Not sure).
I don't think it's necessary as all the tiles should be consistent within an extraction process
• Where does the attribute “landcover_label” come from?
It's something from the input dataset, but I can remove this because it's almost the same value for each sample
• Is it normal that “confidence” attribute is None?
It's also extracted from the input dataset, and it's None already from there for all samples
• Are we sure UINT16 covers the entire possible range of values?
Could you indicate me where is the list of the harmonized labels? This is decided by the user, but in the context of WorldCereal it should be matching the harmonized labels maximum value. Alternatively we can also put in to int64, which will accept negative values and can go on a very high range, it's still remaining a small file as such as it is only a 2D array (1 band, no time)
• There is no no-data value specified in the file. Which one do you use in rasterization process? Should be encoded as such in the netcdf so it’s handled properly downstream.
Yes indeed, I will add this as _FillValue as it seems to be the convention with NetCDF files (this source and in my past experiences)
Another question: how do you want the rasterization? With the all_touched
parameter set to True
or to False
?
https://rasterio.readthedocs.io/en/stable/api/rasterio.features.html#rasterio.features.rasterize
No strategy currently present, I think an easy one would be to drop observations with a cloud percentage higher than, let's say, 95% ? This can be done directly from the OpenEO load_collection feature
As a minimum, indeed good to have this added. We could make it configurable in GFMAP?
I don't think it's necessary as all the tiles should be consistent within an extraction process
Not sure about this, I think we would prefer to extract S1 at 20m and meteo even at much lower resolution. This saves a lot of storage. Downstream OpenEO-based processing will finally merge the cubes.
It's something from the input dataset, but I can remove this because it's almost the same value for each sample
Should not always be the same. 11
means cropland, but there will be other datasets (not the one I sent) which have other landcovers as well. But the crucial thing here is that this attribute only belongs to the center point/field we used and it doesn't necessarily fit the rest of the pixels in the rasterized patch. So I would omit this attribute as it will be confusing.
It's also extracted from the input dataset, and it's None already from there for all samples
Interesting. I guess once we use the API, this shouldn't be None. So let's worry about it later.
Could you indicate me where is the list of the harmonized labels? This is decided by the user, but in the context of WorldCereal it should be matching the harmonized labels maximum value. Alternatively we can also put in to int64, which will accept negative values and can go on a very high range, it's still remaining a small file as such as it is only a 2D array (1 band, no time)
The possible values are in the first column of this file. Note that this is a string for readability and we should strip all -
signs to get to the integer. I think int64
is indeed what is normally used for the new legend.
Another question: how do you want the rasterization? With the all_touched parameter set to True or to False?
As discussed, all_touched
should be False
@GriffinBabe
Issue reported: https://github.com/Open-EO/openeo-geopyspark-driver/issues/712
Except for the scl problem, all is solved here. I created a ticket in WorldCereal technical and open-eo geopyspark driver https://github.com/Open-EO/openeo-geopyspark-driver/issues/715
Using demo dataset from #40 and writing documentation along the way in #42 .