NeuroBench / neurobench

Benchmark harness and baseline results for the NeuroBench algorithm track.
https://neurobench.readthedocs.io
Apache License 2.0
61 stars 12 forks source link

Change data format of MakeyGlass Dataloader #78

Closed YounesBouhadjar closed 1 year ago

YounesBouhadjar commented 1 year ago

The MG dataset provides data of shape (1, # points, 1) and the MG Dataloader provides data of shape (# points, 1, 1). We need to fix this such that the Dataloader also provides data of shape (1, # points, 1), i.e., (batch_size=1, num_steps=#points, input_dim=1).

It is then the model's job to make sure it's looping correctly through the times steps. For example, we could change the for loop in forward in echo_state_network.py:

from

for sample in batch:

to something like:

for index in range(batch.size(1)):
    sample = batch[:, index, :]

or using torch chunk (https://pytorch.org/docs/stable/generated/torch.chunk.html):

for sample in batch.chunk(batch.size(1), dim=1):
jasonlyik commented 1 year ago

Could we also solve this issue by using the DataLoader for training, rather than indexing?

YounesBouhadjar commented 1 year ago

The most important fix is to change:

test_set_loader = DataLoader(test_set, batch_size=mg.testtime_pts, shuffle=False)

to

test_set_loader = DataLoader(test_set[:], batch_size=1, shuffle=False)

and then do the corresponding changes in the models' definition, as I wrote above, because then the mode needs to iterate over the second dimension.

But, yes, for consistency I do prefer using a Dataloader for the training as well.

Btw, the first dimension of the MG dataset refers to the number of time series. Most likely in the near future, we will extend this such that it provides a number of time series instead of 1. At this point, the batch size could be larger than 1, so perhaps we could have the batch size as a variable in:

test_set_loader = DataLoader(test_set[:], batch_size=batch_size, shuffle=False)
jasonlyik commented 1 year ago

If batch refers to the number of time series, doesn't this mean then that the same model would be expected to operate on multiple time series at the same time, or that the same model would need to predict multiple time series without any fitting adjustment?

YounesBouhadjar commented 1 year ago

doesn't this mean then that the same model would be expected to operate on multiple time series at the same time,

Yes, indeed. It could be still that the different time series are from a set of MG time series with the same parameter but with different initial conditions or small chunks from the same time series, @denkle what do you think? does it make sense to have the MG dataset in a future version to support the generation of a number of time series?

But irrespective of this I think I'd still refer to the number of time series as the batch size when defining the Dataloader, so for now, our batch is equal to 1.

jasonlyik commented 1 year ago

Maybe an issue with the data formatting of (1, # points, 1) is that our convention is that labels should have the same batch dimension as the input data batch.

Therefore if we consider the input data to be batch_size = 1, corresponding to one time-series, then the label shape would be (1, # points). Thus, the model must output exactly the number of points expected by the labels in a single forward pass, to be the same shape as the label.

If we were to have models which predict multiple time-series at the same time, to me it seems that we should represent this via feature dimensions -> (# of points to predict, 1, time-series-1, time-series-2, ...)

jasonlyik commented 1 year ago

@YounesBouhadjar Based on discussion today I will close this issue. The MG dataloader output should be (B, M, D) B = batch size, the number of predicted points M = bin_window, the number of prior points provided for each prediction D = features