Closed ayushkarnawat closed 4 years ago
Should we revert back to the old style of serializing and loading the dataset as full Tensors
rather than as individual examples? This would help with performance as loading a batch would be simply loading each array and slicing them at the specified indices. NOTE: This is just a conjecture, needs to be more fully verified. See TensorDataset for information on how to load the tensor datasets.
However, if we save the dataset in shards/batches (see #35), then it will become hard to save each example individually. Instead, we will instead have to concat each new array into is respective array, which might be hard when you have already saved arrays with the same names. Note that this will also require checking if the new example array's have the same size/shape as the batches in the previous batch.
When attempting to slice a dataset in the pythonic way (i.e.
dataset[0:5]
), to obtain certain batches of data, we get the following error: