feature: Using an xarray coordinate dimension as string for `station_id`, timestamps for `time`

jmp75 commented 2 weeks ago

Is your feature request related to a problem? Please describe.

Loading an STF netcdf file from disk via xarray, as is and without further processing, leads to the following in memory dataset:

These days, such a dataset would arguably be designed using strings wherever possible for coordinates. In particular slicing through data to retrieve data for a given station can be done with the statement rain.sel( station_id = "123456A") .

Trying to do a similar selection on the data set as loaded, rain.sel(station_id=123456) would lead to KeyError: "'station_id' is not a valid dimension or coordinate

A similar observation can be done for the time dimension needing to use date/time representation, perhaps even more so than for station_id

Note also that having coordinates that are starting at one (i.e. Fortran indexing) rather than zero (C/Python) is not inconsequential in practice; this is notoriously fertile ground for "off by one" bugs when slicing data for use in modelling.

Describe the solution you'd like

The In-memory representation of STF via xarray has the following coordinates

"time" (pandas.Timestamp or np.datetime64)
station_id (str)
ens_member (int) (note: probably can only be int in python anyway, likely 64 bits. int32 retained on-disk)
lead_time (int)

Additional context

Possible downsides:

What is the performance downside, if any noticeable, when converting the in-memory data for writing to disk with "to_netcdf"
This would likely make using xarray "lazy-loading" capabilities not possible anymore. Does it really matter "these days"?

Other considerations:

"RAM windowed" time series for very large data not fitting in RAM. xarray backed by netcdf anyway does not offer native capabilities for this so far as I know. One would need to use zarr as an xarray backend, or perhaps other options (time series DBs?)

jbcsiro commented 2 weeks ago

Yeah we could move strings in station_name for sure. The reason we didn't at the time was lack of compatibility with the Fortran netCDF library. For this reason (and for compatibility with older scripts, many of which are still in use), it might be better to move to a new variable name: perhaps station_str or similar. We also should change 'ens_member' to 'realization', for better CI compatibility. Probably there are a bunch of other things that should also be revisited.

derob commented 2 weeks ago

In memory, time should be represented in a time format, for our forecasting purposes I think the default np.datetime64 for xarray is OK, but it isn’t always OK if you need to deal with records that are >~400 years.
Station_id probably should be a string. There are so many AWRC codes with letters in them that ints don’t make sense.
Agree with James on converting ens_member to realization - could couple this to redefining it to be zero based integer
In memory it would be great to have lead_time as a timedelta reading the associated time units (hours, days etc). Then it would be possible to add it to the time variable to get the valid time of the relevant data.

It would be ideal to preserve lazyloading if possible for the main data variables - time, lead_time, realization, station don't necessarily use large amounds of memory on their own but the main data variables can.

jbcsiro commented 2 weeks ago

@derob I don't know about making station_id a string. Isn't swift partly predicated on using station_id as integers to construct node-link networks? If anything, station_name would be a good candidate to convert to a string (as was originally intended, but couldn't be done due to Fortran issues), but as I said you may run into backwards compatibility issues.

derob commented 2 weeks ago

We tend to use integers for station_id but they are referenced as strings in my understanding... The onboarding python example uses a single subarea model with the of subarea.Subarea

jmp75 commented 2 weeks ago

Thank you for your inputs, much appreciated. To be clear, I was not suggesting any change to the STF netCDF convention, only the "python, in memory xarray". Writing a STF-2.0 compatible netCDF file to disk is a non-negotiable feature.

@derob indeed swift2 nodes, links and subareas are all identified with strings, which can also cater for a legacy integer representation of course.

fre171csiro commented 2 weeks ago

I have been defining station_id as strings, as @derob said some (SILO and BoM) curated stations have chars as part of their unique id. I have also been using pint_xarray to ensure units are applied to dataarrays.

jbcsiro commented 2 weeks ago

So loading station_id as strings as @jmp75 has suggested is no problem; saving them as strings in the .nc files, which I think is what Andrew is doing, implies a change in spec and might cause issues with backwards compatibility (at least if we keep those variable names). Not necessarily saying I'm against it; this might be the preferable way to go with v3.0 (or whatever) of the spec.

jmp75 commented 2 weeks ago

@jbcsiro we are very much on the same wavelength regarding not breaking backward compat STF v2.0 and including across languages. As an aside a v3.0 specs may be also within the scope of the WMO group @derob mentioned and is a member of?

jbcsiro commented 2 weeks ago

Very happy to improve the spec for both usebility and CF compatibility. Unfortunately a lot of those conventions are dominated by the meteorological community, who tend to favour spatial coverage over time when saving data (so you tend to get large spatial coverage in files with only a single forecast). Keeping many forecasts in one file is very beneficial for us though: both for storage and ease of verification. Suspect it ultimately may be a losing fight.

derob commented 2 weeks ago

@jbcsiro one of the really nice things about xarray, which we should probably think about when developing the package is the ability to automagically read multiple files on disk into a single in-memory object... There are issues, of course, with managing large numbers of small files, but there is probably a balance and maybe single forecast files are ok in the future from an analysis perspective.

fre171csiro commented 2 weeks ago

That is an interesting topic, when to separate into individual files and went to not separate and persist all associated data variables into one file. Moving one large file about has challenges, but keeping order with separate files is also a difficult. At the moment I feel safer and happy to take the hit with one large file stored in a central network drive.

derob commented 2 weeks ago

That is an interesting topic, when to separate into individual files and went to not separate and persist all associated data variables into one file. Moving one large file about has challenges, but keeping order with separate files is also a difficult. At the moment I feel safer and happy to take the hit with one large file stored in a central network drive.

There are pros and cons of large and small files. I generally prefer large files, but sometimes it is more efficient computationally to generate many small files. There can be limits on file numbers on some storages which means that many small files aren't a good idea. zarr 'files' are actually just a folder of many small files that chunk data in a particular way, so there needs to be some careful thought about the most appropriate chunking on disk for zarr.

fre171csiro commented 2 weeks ago

That is an interesting topic, when to separate into individual files and went to not separate and persist all associated data variables into one file. Moving one large file about has challenges, but keeping order with separate files is also a difficult. At the moment I feel safer and happy to take the hit with one large file stored in a central network drive.

There are pros and cons of large and small files. I generally prefer large files, but sometimes it is more efficient computationally to generate many small files. There can be limits on file numbers on some storages which means that many small files aren't a good idea. zarr 'files' are actually just a folder of many small files that chunk data in a particular way, so there needs to be some careful thought about the most appropriate chunking on disk for zarr.

From ---> https://docs.dask.org/en/stable/array-chunks.html

Chunks should align with your storage, if applicable.
Array data formats are often chunked as well. When loading or saving data, if is useful to have Dask array chunks that are aligned with the chunking of your storage, often an even multiple times larger in each direction

So I don't think you need to have many files in the same dimension as in memory chunks, however it might be better/advantageous. So if the file number becomes limiting it could be reduced by increasing the file size but keeping in memory the same.

and https://docs.dask.org/en/stable/array-chunks.html#loading-chunked-data

jbcsiro commented 2 weeks ago

Hydrological forecasts are often produced at the catchment scale, so saving individual forecasts produces too many files relative to file size IMO. When it comes time to move or archive those forecasts this is very annoying, especially when you can just save e.g. a month's worth of forecasts in a single file (assuming you have the spec to do it) and make that stuff a lot easier. This is generally not the case with meteorological forecasts, esp for global models.

csiro-hydroinformatics / efts-io