Closed GabrielBG0 closed 8 months ago
Note that Dataset must perform loading operations from Reader. Thus, readers should be parameters of Generic Datasets.
I made a Generic Dataset class that uses readers and transforms as we have discussed. Check #23 and see if it makes sense. We can discuss here or in the PR.
Dataset
The
Dataset
class is responsible for loading data, preprocessing it, and returning it in a suitable format for training. For implementing a new dataset, PyTorch suggests an abstract class calledtorch.utils.data.Dataset
that should be inherited to create a new dataset. This class defines a dataset mapping indices to samples and implies the implementation of two methods:__len__
and__getitem__
. The__len__
method should return the size of the dataset, i.e., the number of samples, and the__getitem__
method should return the sample at indexi
from the dataset.Below are two examples of datasets that inherit from the
torch.utils.data.Dataset
class:Note that we have two examples of datasets. The first example is a simple dataset, where the samples are represented by a 1D numpy array. The second example is an image dataset, where the samples are represented by a pair of image and label. Thus, it is worth noting that the
Dataset
class is flexible enough to handle different types of data.Responsibilities of the
Dataset
classAlthough the
Dataset
class is flexible enough to handle different types of data, the API of theDataset
class does not strictly define how samples should be returned, nor how data should be loaded, organized, and read, as this depends on the problem being solved.In summary, an object of the
Dataset
class is responsible for handling four factors:PIL
oropencv
library; when dealing with volumetric data, data can be loaded using thenumpy
orzarr
library. In general, loading is done through functions that read data from the storage device and return it in a suitable format for preprocessing. Typically, the loading method is closely related to the file extensions used.I suggest that our datasets follow these responsibilities and implement the
Dataset
class API appropriately for the problem being solved.However, I still can't see how this can be done in a generic way, since the structure of the storage device data, data loading, and data preprocessing are specific to each problem. Perhaps we can first discuss these responsibilities and create abstract or intermediate interfaces to handle them.