ec-jrc / pyPoseidon

Framework for Hydrodynamic simulations
https://pyposeidon.readthedocs.io/
European Union Public License 1.2
20 stars 8 forks source link

Split meteo into daily chunks #159

Closed brey closed 1 year ago

brey commented 1 year ago

Currently we have a way to split the 3 day meteo dataset in daily chunks with

                if split_by:
                    times, datasets = zip(*self.meteo.Dataset.groupby("time.{}".format(split_by)))
                    mpaths = ["sflux_air_{}.{:04d}.nc".format(m_index, t + 1) for t in np.arange(len(times))]
                    for das, mpath in list(zip(datasets, mpaths)):
                        self.to_force(
                            das, vars=["msl", "u10", "v10"], rpath=path, filename=mpath, date=self.rdate, **kwargs
                        )
                else:
                    self.to_force(self.meteo.Dataset, vars=["msl", "u10", "v10"], rpath=path, **kwargs)

which outputs 4 daily chunks.

However, there is a problem with the groupby functionality.

One can see it below:

import pandas as pd

import xarray as xr

tt = pd.date_range('2023-5-30','2023-6-2')

tt.values

    array(['2023-05-30T00:00:00.000000000', '2023-05-31T00:00:00.000000000',
           '2023-06-01T00:00:00.000000000', '2023-06-02T00:00:00.000000000'],
          dtype='datetime64[ns]')

d = xr.Dataset(
    {
        "time": (("time"), tt.values),
    }
)

d.groupby("time.day")

    DatasetGroupBy, grouped over 'day'
    4 groups with labels 1, 2, 30, 31.

d.groupby("time.dayofyear")

    DatasetGroupBy, grouped over 'dayofyear'
    4 groups with labels 150, 151, 152, 153.

tt2 = pd.date_range('2023-12-30','2024-1-2')

d2 = xr.Dataset(
    {
        "time": (("time"), tt2.values),
    }
)

d2.groupby("time.dayofyear")

    DatasetGroupBy, grouped over 'dayofyear'
    4 groups with labels 1, 2, 364, 365.

Using dayofyear solves it for the monthly occurrence but we get the same for cross years.

@pmav99 @tomsail Any ideas?

pmav99 commented 1 year ago
times = pd.date_range("2023-12-30", "2024-01-02", freq="h")
ds = xr.Dataset(
    {
        "time": (("time"), times),
    }
)
ds.resample(time="1D")

source: https://stackoverflow.com/a/54433261/592289

brey commented 1 year ago

@pmav99 It works. Since we have no gaps there is no problem with resample. Thanks!

pmav99 commented 1 year ago

By the way, since from our tests we saw that split by day gives some performance boost, should we perhaps enable it by default?

brey commented 1 year ago

I don't think so. If there is a gap in time, resample will interpolate, thus increasing the wall time. Also, if the file is small (regional window) there might not be a benefit. We should document this well and let the user decide.