Closed jaycunningham-8451 closed 4 months ago
Hi @jaycunningham-8451,
If you are interested in large data settings we would need to modify the dataloader class:
In particular one would need to modify the __getitem__(self, idx)
method to load from directories rather than parsing chunks of a pandas panel dataframe.
We would be really interested in collaborating for such modifications. Is your dataset public?
@kdgutier, apologies for ghosting you here. Some other work got in the way.
Unfortunately, my dataset is not public. It's very large, though, made up of thousands of individual time series with additional covariates, so it's infeasible to read this into memory at once.
I've been reading through some of the source for implementing TimeSeriesDataset
, TimeSeriesLoader
, and TimeSeriesDataModule
. After some experimentation, it appears that the data being fed into windowed models is the entire time series for each unique id (up to the batch size). Is this time series then converted into individual windows?
I think it would be easy enough, for my purposes, to fork NeuralForecast and replace the data loader with one that reads from parquet via Petastorm, though some function signatures would have to change as well. Properly adding support for custom loaders would be more difficult.
Hey @jaycunningham-8451,
As you mentioned, our dataset reads the entire series and feeds them into WindowsBase
or RecurrentBase
kind of models.
The idea that comes to my mind, is to mimic what people do in the computer vision literature is to read single photos individually, it might be possible to transform the numpy parsing operations from the TimeSeriesDataset.__getitem__
.
Here is some documentation Pytorch large data generator.
This option would require a new method to create specialized files indexed by the unique ids, that would be easily parsed back by the __getitem__
.
Is this similar to what you think?
That approach would work in general. In our case, the data are stored in the cloud in Parquet format, but I could certainly rig up some ETL to write out one time series per file.
That said, in the past I've used Petastorm for this purpose, which allows me to stream directly from a dataset in cloud storage. All I have to do is set it up so that it's one row per time series (with vectors for time-varying data), and Petastorm will stream those to me in batches as tensors. That's why I was thinking in terms of providing custom dataloaders rather than implementing some codified way to stream data as needed. I don't know how easy that is given the design of NeuralForecast, however, nor have I thought deeply about what changes to the API this would necessitate.
That said, I'm not a deep learning expert. I'm happy to go with what you think is best for this task, and I can work with you in any case.
I have not worked yet on large enough dataset that require me to go beyond the local disks. Would it be possible to try first a version of the loader that works on single series assuming that we read from disk. I picture scaling that code towards reading from the cloud could be very easy.
I think the most important part would be to have a parsing function that creates such dataset.
I would be interested in implementing this. Is there documentation how TimeSeriesLoader can be used with the NeuralForecast
package?
Hey @all is there any update on that regard? I am facing the same issues. I need a custom large scale dataloader.
Best.
Hi, we’re also facing the same issue: our dataset contains thousands of timeseries, each of which can fit into memory however which when combined into a single pandas dataframe is too large.
Having had a look at the code, it looks like a way forward would indeed be to go the route suggested by @kdgutier.
We're thinking of creating a new class IterativeTimeSeriesDataset
that also extends either Dataset
or TimeSeriesDataset
. We can use the existing _FileDataset
to help with this.
The __getitem__
method will then just read the parquet file corresponding to that index, and we will replace from_df()
with a from_data_directory()
method that has all the same functionality.
This would be something like
class IterativeTimeSeriesDataset(Dataset):
def __init__(
self,
files_ds: _FilesDataset,
y_idx: int,
static=None,
static_cols=None,
sorted=False,
):
super().__init__()
...
self.n_groups = len(files_ds.files)
...
def __getitem__(self, idx):
if isinstance(idx, int):
...
df = pd.read_parquet(self.files_ds.files[idx])
_, _, data, _, _ = ufp.process_df(
df, self.files_ds.id_col, self.files_ds.time_col, self.files_ds.target_col
)
temporal = data.astype(np.float32, copy=False)
...
One thing to note is the scaler_transform
and predict_insample
methods in core.py
seem to be quite reliant on being able to access TimeSeriesDataset.temporal
value which won’t be available in this new class. For the moment we can raise a NotImplementedError
here, similar to what's done for _fit_distributed
currently.
The only other issue is the uids
, last_dates
and ds
attributes created in from_df
which are dependent on the full set of underlying data. It looks like these aren't returned in prepare_fit_distributed
- could we do the same for our use-case?
Open to suggestions around how we can best fit this in with the existing api. Ideally we could subclass TimeSeriesDataset
, but this seems pretty specc'd to an in-memory solution at present.
Hi, just flagging I have now created the above review to handle this for the case where the dataset is split between multiple parquet files, each containing a pandas or polars dataframe https://github.com/Nixtla/neuralforecast/pull/1049.
The documentation for neuralforecast appears geared toward the use of pandas data frames.
I'm working on a time series forecasting problem for which data cannot be fit into memory on a single machine, and I've used petastorm in the past for this purpose. Is it possible to prepare a dataset manually as a dataloader and feed it to a neuralforecast model for training?
I'm happy to do any modification necessary, I just want to make sure that it's reasonable. From a cursory reading of the source it appears that the data frame assumption is fairly well embedded (but I could be incorrect).