TUDelftGeodesy / stmtools

Xarray extension for Space-Time Matrix
https://tudelftgeodesy.github.io/stmtools/
Apache License 2.0
6 stars 0 forks source link

Controlling output chunk size and type #44

Closed fnattino closed 10 months ago

fnattino commented 10 months ago

I am loading the test full-pixel_psi_amsterdam_tsx_asc_t116_v4_ampl_std_H_c16643.csv.part CSV file (246 MB) with the from_csv function and writing it out as a Zarr store.

When I am loading the STM data from the CSV file I get the following dataset:

stm = stmtools.from_csv(CSV_STM_PATH, output_chunksize={'space': 25_000, 'time': -1})
stm = stm.persist()
stm
<xarray.Dataset>
Dimensions:                (space: 50000, time: 198)
Coordinates:
  * space                  (space) int64 0 1 2 3 4 ... 49996 49997 49998 49999
  * time                   (time) int64 0 1 2 3 4 5 ... 192 193 194 195 196 197
    lat                    (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    lon                    (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
Data variables: (12/25)
    pnt_id                 (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_flags              (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_line               (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_pixel              (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_height             (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_demheight          (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    ...                     ...
    pnt_std_linear         (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_quadratic      (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_seasonal       (space) object dask.array<chunksize=(25000,), meta=np.ndarray>
    deformation            (space, time) object dask.array<chunksize=(25000, 198), meta=np.ndarray>
    amplitude              (space, time) object dask.array<chunksize=(25000, 198), meta=np.ndarray>
    h2ph                   (space, time) object dask.array<chunksize=(25000, 198), meta=np.ndarray>

So all variables and coordinates (dimension coordinates excluded) have object type, and the chunk size is correctly as specified in input.

Only after writing the dataset to Zarr the data type is resolved. However, the chunks of the variables that are both space and time dependent is now modified (see e.g. aplitude below):

stm.to_zarr(ZARR_STM_PATH, mode='w')
stm_ = xr.open_zarr(ZARR_STM_PATH)
print(stm_)
<xarray.Dataset>
Dimensions:                (space: 50000, time: 198)
Coordinates:
    lat                    (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    lon                    (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
  * space                  (space) int64 0 1 2 3 4 ... 49996 49997 49998 49999
  * time                   (time) int64 0 1 2 3 4 5 ... 192 193 194 195 196 197
Data variables: (12/25)
    amplitude              (space, time) float64 dask.array<chunksize=(6250, 25), meta=np.ndarray>
    deformation            (space, time) float64 dask.array<chunksize=(6250, 25), meta=np.ndarray>
    h2ph                   (space, time) float64 dask.array<chunksize=(6250, 25), meta=np.ndarray>
    pnt_ampconsist         (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_demheight          (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_demheight_highres  (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    ...                     ...
    pnt_seasonal_sin       (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_defo           (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_height         (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_linear         (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_quadratic      (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>
    pnt_std_seasonal       (space) float64 dask.array<chunksize=(25000,), meta=np.ndarray>

My questions:

  1. Why is the chunk size modified when writing to Zarr?
  2. Can one figure out the correct data type already when loading the CSV file?
rogerkuou commented 10 months ago

fixed by #46