Closed claytharrison closed 7 months ago
Restructured ragged_array_ts.py to use functions for merging indexed/contiguous datasets together from source files, rather than using class methods on CRANcFile
and IRANcFile
to create a mergeable "compatible" format file.
merge_netCDFs()
takes a list of filenames/paths as an argument, does its magic to merge them along the observation dimension, and returns the merged dataset, which has been deduplicated and sorted by time. It returns a contiguous array by default, but you can optionally set out_format
to "indexed"
to override this. The time window for detecting duplicate values, dupe_window
, is None
by default and set to np.timedelta64(10, "m")
within the function if None
.
There are helper functions for converting contiguous to indexed ragged arrays and vice-versa, set_attributes()
to help set reasonable output attributes on the dataset which merge_netCDFs
returns, and create_encoding()
to create a reasonable encoding dictionary to pass to .to_netcdf()
when writing the dataset to file. Both of those will take a user-created attribute/encoding dictionary as an argument, and override the default values for any given key with those passed by the user.
Things to do yet:
This adds methods to CRANcFile and IRANcFile to read in netcdf time series files from Indexed Ragged Array or Contiguous Ragged Array format into a compatible Indexed Ragged Array format that is easy to concatenate to other time series from the same grid cell, and a writer that writes these compatible arrays to netcdf time series in Contiguous Ragged Array format.
This is achieved by filling out the coordinates and variables in the
locations
dimension with data for alllocation_id
s in the relevant cell, sorting that dimension bylocation_id
, and then updating thelocationIndex
for all observations to align with the new ordering. In this way, a givenlocationIndex
refers to the samelocation_id
for any time series produced in a given cell, and they can be concatenated safely.When writing to Contiguous Ragged Array format, observations are ordered by
locationIndex
and bytime
, arow_size
variable is calculated from thelocationIndex
,locationIndex
is dropped, and appropriate attributes and encoding are set.Some questions that still remain:
compatible_writer
to live by itself as a function? I didn't want to put it as an object method since it operates on the concatenation result of several objects' data. I also don't like that I have some helper functions and dictionaries just hanging out in there, I might fold those into the compatible_writer