Add read/write to concat-compatible xarray format

TUW-GEO / ascat

Read and visualize data from the Advanced Scatterometer (ASCAT) on-board the series of Metop satellites

MIT License

23 stars 16 forks source link

This adds methods to CRANcFile and IRANcFile to read in netcdf time series files from Indexed Ragged Array or Contiguous Ragged Array format into a compatible Indexed Ragged Array format that is easy to concatenate to other time series from the same grid cell, and a writer that writes these compatible arrays to netcdf time series in Contiguous Ragged Array format.

This is achieved by filling out the coordinates and variables in the locations dimension with data for all location_ids in the relevant cell, sorting that dimension by location_id, and then updating the locationIndex for all observations to align with the new ordering. In this way, a given locationIndex refers to the same location_id for any time series produced in a given cell, and they can be concatenated safely.

When writing to Contiguous Ragged Array format, observations are ordered by locationIndex and by time, a row_size variable is calculated from the locationIndex, locationIndex is dropped, and appropriate attributes and encoding are set.

Some questions that still remain:

Should the readers live in CRANcFile and IRANcFile objects, or is there a better place, perhaps in GridCellFiles?
Is it ok for compatible_writer to live by itself as a function? I didn't want to put it as an object method since it operates on the concatenation result of several objects' data. I also don't like that I have some helper functions and dictionaries just hanging out in there, I might fold those into the compatible_writer
Still need to sort out some attributes/encoding/NaN questions

Restructured ragged_array_ts.py to use functions for merging indexed/contiguous datasets together from source files, rather than using class methods on CRANcFile and IRANcFile to create a mergeable "compatible" format file.

merge_netCDFs() takes a list of filenames/paths as an argument, does its magic to merge them along the observation dimension, and returns the merged dataset, which has been deduplicated and sorted by time. It returns a contiguous array by default, but you can optionally set out_format to "indexed" to override this. The time window for detecting duplicate values, dupe_window, is None by default and set to np.timedelta64(10, "m") within the function if None.

There are helper functions for converting contiguous to indexed ragged arrays and vice-versa, set_attributes() to help set reasonable output attributes on the dataset which merge_netCDFs returns, and create_encoding() to create a reasonable encoding dictionary to pass to .to_netcdf() when writing the dataset to file. Both of those will take a user-created attribute/encoding dictionary as an argument, and override the default values for any given key with those passed by the user.

Things to do yet:

Go over the default attributes/encoding to make sure they're what we want.
Could be made more general - right now it assumes the names of many variables on the dataset, as well as the presence of a "sat_id" variable when sorting before deduplicating. (I could probably remove the deduplicator and leave it up to the user to do that after merging.) The package cf-xarray might be useful for this.
Generally clean things up a bit
More things that I may edit in here when I think of them, but I just want to get this comment posted

TUW-GEO / ascat

Add read/write to concat-compatible xarray format #54