SAEON / somisana

SOMISANA-related tooling
MIT License
6 stars 2 forks source link

[Algoa Bay Forecast] NetCDF structure, conventions, etc. #13

Open zachsa opened 2 years ago

zachsa commented 2 years ago

Hi @mattcarr03.

I'm working with the processed model output using GDAL / raster2pgsql (actually I think this also might be GDAL, wrapped by PostGIS). This work relates to #12.

Running the raster2pgsql CLI I get a warning:

Warning 1: No UNIDATA NC_GLOBAL:Conventions attribute

Do you know what it would take to write the processed NetCDF file in a way that is compliant with CF conventions (climate and forecast). And is this worth doing?

I also ran this past @Marc-Pienaar to see if he has any insight

This is the command

raster2pgsql \
  -d \
  -q \
  -I \
  -t auto \
  -R \
  NETCDF:"/path/to/hourly-avg-20220705-normalized.nc":temperature \
  public.test_raster_input \
  | psql \
    postgres://admin:password@localhost:5432/somisana_local
zachsa commented 2 years ago

This is the printout of the header using xarray

<xarray.Dataset>
Dimensions:      (time: 240, depth: 20, lat: 106, lon: 152)
Coordinates:
    lon_rho      (lat, lon) float32 ...
    lat_rho      (lat, lon) float32 ...
  * depth        (depth) float32 -0.975 -0.925 -0.875 ... -0.125 -0.075 -0.025
  * time         (time) datetime64[ns] 2022-06-30T01:00:00 ... 2022-07-10
Dimensions without coordinates: lat, lon
Data variables:
    temperature  (time, depth, lat, lon) float32 ...
    salt         (time, depth, lat, lon) float32 ...
    u            (time, depth, lat, lon) float64 ...
    v            (time, depth, lat, lon) float64 ...
    m_rho        (time, depth, lat, lon) float64 ...
Attributes:
    description:  CROCO output from algoa Bay model transformed lon/lat/depth...

This is the printout using gdalinfo

zach@ZACH-SP4:~/code/saeon$ gdalinfo /home/zach/somisana-data/hourly-avg-20220705-normalized.nc
Driver: netCDF/Network Common Data Format
Files: /home/zach/somisana-data/hourly-avg-20220705-normalized.nc
Size is 512, 512
Metadata:
  NC_GLOBAL#description=CROCO output from algoa Bay model transformed lon/lat/depth/time
Subdatasets:
  SUBDATASET_1_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":temperature
  SUBDATASET_1_DESC=[240x20x106x152] temperature (32-bit floating-point)
  SUBDATASET_2_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":salt
  SUBDATASET_2_DESC=[240x20x106x152] salt (32-bit floating-point)
  SUBDATASET_3_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":u
  SUBDATASET_3_DESC=[240x20x106x152] u (64-bit floating-point)
  SUBDATASET_4_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":v
  SUBDATASET_4_DESC=[240x20x106x152] v (64-bit floating-point)
  SUBDATASET_5_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":m_rho
  SUBDATASET_5_DESC=[240x20x106x152] m_rho (64-bit floating-point)
  SUBDATASET_6_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":lon_rho
  SUBDATASET_6_DESC=[106x152] lon_rho (32-bit floating-point)
  SUBDATASET_7_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":lat_rho
  SUBDATASET_7_DESC=[106x152] lat_rho (32-bit floating-point)
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  512.0)
Upper Right (  512.0,    0.0)
Lower Right (  512.0,  512.0)
Center      (  256.0,  256.0)
zachsa commented 2 years ago

At a guess, it looks like the srid may not be correct (at least I can't see what it is).

Thanks @Marc-Pienaar - https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html

Marc-Pienaar commented 2 years ago

Hi Zach,

I suspect the geospatial info needed by gdal is missing, hence the warning (projection, lat, lon coordinates etc), Or possibly in the wrong format, etc. Might need an intermediate step in your workflow to explicitly assign these if you are going to use gdal to do operations and/or want to make the data spatially explicit. For instance is lat_rho / lon_rho index values from the array 1…N or geographic coordinates?

Regards, Marc On 06 Jul 2022, 13:48 +0200, Zach Smith @.***>, wrote:

This is the printout of the header using xarray

Dimensions: (time: 240, depth: 20, lat: 106, lon: 152) Coordinates: lon_rho (lat, lon) float32 ... lat_rho (lat, lon) float32 ... * depth (depth) float32 -0.975 -0.925 -0.875 ... -0.125 -0.075 -0.025 * time (time) datetime64[ns] 2022-06-30T01:00:00 ... 2022-07-10 Dimensions without coordinates: lat, lon Data variables: temperature (time, depth, lat, lon) float32 ... salt (time, depth, lat, lon) float32 ... u (time, depth, lat, lon) float64 ... v (time, depth, lat, lon) float64 ... m_rho (time, depth, lat, lon) float64 ... Attributes: description: CROCO output from algoa Bay model transformed lon/lat/depth... This is the printout using gdalinfo ***@***.***:~/code/saeon$ gdalinfo /home/zach/somisana-data/hourly-avg-20220705-normalized.nc Driver: netCDF/Network Common Data Format Files: /home/zach/somisana-data/hourly-avg-20220705-normalized.nc Size is 512, 512 Metadata: NC_GLOBAL#description=CROCO output from algoa Bay model transformed lon/lat/depth/time Subdatasets: SUBDATASET_1_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":temperature SUBDATASET_1_DESC=[240x20x106x152] temperature (32-bit floating-point) SUBDATASET_2_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":salt SUBDATASET_2_DESC=[240x20x106x152] salt (32-bit floating-point) SUBDATASET_3_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":u SUBDATASET_3_DESC=[240x20x106x152] u (64-bit floating-point) SUBDATASET_4_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":v SUBDATASET_4_DESC=[240x20x106x152] v (64-bit floating-point) SUBDATASET_5_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":m_rho SUBDATASET_5_DESC=[240x20x106x152] m_rho (64-bit floating-point) SUBDATASET_6_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":lon_rho SUBDATASET_6_DESC=[106x152] lon_rho (32-bit floating-point) SUBDATASET_7_NAME=NETCDF:"/home/zach/somisana-data/hourly-avg-20220705-normalized.nc":lat_rho SUBDATASET_7_DESC=[106x152] lat_rho (32-bit floating-point) Corner Coordinates: Upper Left ( 0.0, 0.0) Lower Left ( 0.0, 512.0) Upper Right ( 512.0, 0.0) Lower Right ( 512.0, 512.0) Center ( 256.0, 256.0) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.Message ID: ***@***.***>
mattcarr03 commented 2 years ago

Hi Zach,

I have just been having I look now, and I think the problem is that the model output in on a curvilinear grid, therefore the lon and lat coordinates are 2D rather than 1D. Therefore, the dimensions are 1D grid points 1 to 152 and the 2D coordinates are saved as variables, this sounds like it wouldn't work looking at @Marc-Pienaar comment. Sorry, I wasn't sure if gdal could handle this.

We can re-grid the data to a regular grid using python, in the post-processing step. That way we can have 1D lon and lat saved as dimensions. Which might help with the error, I imagine working with a regular grid makes things a lot easier.

I will push the changes shortly

Matt

zachsa commented 2 years ago

Hi @mattcarr03 - thanks.

What do you mean by a 2D vs 1D grid? I'm not sure I follow - I see longitude / latitude roughly as 'x' an 'y' coordinates on a Cartesian plane (i.e. a 4326 project). My understanding is that any map image that I see that is in the 4326 SRID has to be projected to a flat surface, but that this is done behind the scenes and not something I have to think about as far as I know.

If I'm following correctly... I understand that the CF conventions require separate dimensions for lng/lat, whereas the current lng/lat is a single 2D dimension?

I am still fumbling my way through the GDAL tool (actually, I'm using the raster2pgsql tool that uses GDAL under the hood). This tool registers rasters in a postgres table and then allows me to run GDAL commands on the raster file via PostgreSQL's SQL interface. I'm getting extremely odd information back from the GDAL commands - such as GDAL telling me there are 4 800 bands in the processed data (and 800 bands in the model output). Obviously this doesn't make much sense... (and I guess could be related to a single dimension with 2D data if I understand the above correctly). I haven't tested your changes yet - will do so soon, but I'd still like to understand.

I was imagining a single 'band' as one possible N D raster - i.e. I thought that a single variable in the NetCDF file would correspond to a single band in a raster file. Is this correct/not correct? (@Marc-Pienaar - do you have any insight here? from what I have read the word 'band' doesn't seem to have a fixed meaning across different types of rasters).

zachsa commented 2 years ago

... Yes @mattcarr03. It's quite challenging to provision an environment where xesmf will be available at deployment time. So far I can see that:

xesmf => depends on esmpy => depends on esmf with special requirements for a netcdf install
zachsa commented 2 years ago

https://gdal.org/drivers/raster/netcdf.html

zachsa commented 2 years ago

Actually, the large number of bands does make sense (i.e. when I load the raster into PostGIS via raster2pgsql I see 4800 bands)

For the temperature variable, there are:

That means there 4 800 'slices'. PostGIS (i.e. GDAL) is seeing 4 800 x/y grids. Each grid is 152 x 106 pixels, and each pixel is a temperature value. So I am actually seeing the data in PostGIS.

@mattcarr03, I did look at the tool for regridding the data in the python step. Since it's difficult to install I'm still trying other things.

... it never actually occurred to me to ask why the image that you showed me didn't have straight edges. But @Marc-Pienaar showed me an illustration of GPS coordinates of curvilinear grid... and it does look a lot like the output of temperature at one depth that you previously showed me.

If we regrid, wouldn't that image then be changed?

figure

zachsa commented 2 years ago

@mattcarr03. also, could you send me an example of regridded file?

mattcarr03 commented 2 years ago

Hi @zachsa, sorry for the delayed replies, I'm sending one over email shortly and will respond to the other comments.

mattcarr03 commented 2 years ago

Sorry if the previous comments weren't clear.

I was trying to illustrated that the data is not set on a regular/square grid. Instead the data is rotated and curved. Therefore, the data points each have a unique lon and lat points, as opposed to sharing longitude points across a vertical slice (or latitude points across a horizontal slice).

This means that the lon and lat co-ordinates need to be represented on a 2D matrix, rather than a 1D array. However, the netcdf convection requires the dimensions to be a single 1D array. Therefore, the output netcdf uses the number of grid points as the x and y instead of real world coordinates. The coordinates are saved as "coordinates" which I understand to be an extra class similar to variables.

By regridding to a regular grid (constant longitude for a vertical slice/constant latitude for a horizontal slice) the dimensions of the netcdf can be a 1D array of real world coordinates. This imagine is much easier to convert into a raster.

I checked the regridding and the image in not changed, there is lots of empty data/whitespace around the image to maintain the curve and rotation.

Additionally, I have managed to regrid using the scipy package (https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp2d.html), however this requires looping through time and depth so it is not nearly as eloquent as the xesmf solution.

zachsa commented 2 years ago

image