Dataset

GabrielBG0 commented 8 months ago

The Dataset class is responsible for loading data, preprocessing it, and returning it in a suitable format for training. For implementing a new dataset, PyTorch suggests an abstract class called torch.utils.data.Dataset that should be inherited to create a new dataset. This class defines a dataset mapping indices to samples and implies the implementation of two methods: __len__ and __getitem__. The __len__ method should return the size of the dataset, i.e., the number of samples, and the __getitem__ method should return the sample at index i from the dataset.

Below are two examples of datasets that inherit from the torch.utils.data.Dataset class:


import numpy as np
import PIL
from torch.utils.data import Dataset

class SimpleDataset(Dataset):
    def __init__(self, data: np.ndarray):
        """
        Args:
            data (np.ndarray): 2D array of data, where the first dimension is 
            the samples and the second dimension is the features.
        """
        self.data = data

    def __len__(self) -> int:
        return len(self.data)

    def __getitem__(self, idx: int) -> np.ndarray:
        """
        Args:
            idx (int): index of the sample to return.
        Returns:
            np.ndarray: the sample at the given index (1D numpy array).
        """
        return self.data[idx]

class LabeledImageDataset(Dataset):
    def __init__(self, image_files: List[str], labels: List[int]):
        """
        Args:
            image_files (List[str]): list of paths to image files.
            labels (List[int]): list of labels.
        """
        self.image_files = image_files
        self.labels = labels

    def __len__(self) -> int:
        return len(self.image_files)

    def __getitem__(self, idx: int) -> Tuple[np.ndarray, int]:
        """
        Args:
            idx (int): index of the sample to return.
        Returns:
            Tuple[PIL.Image.Image, int]: the image and its label.
        """
        image = PIL.Image.open(self.image_files[idx])
        label = self.labels[idx]
        return (image, label)

dataset_1 = SimpleDataset(np.random.rand(100, 10))
len(dataset_1)  # 100
sample = dataset_1[0]  # 1D numpy array

dataset_2 = LabeledImageDataset(["image1.jpg", "image2.jpg"], [0, 1])
len(dataset_2)  # 2
sample = dataset_2[0]  # (np.ndarray, int)

Note that we have two examples of datasets. The first example is a simple dataset, where the samples are represented by a 1D numpy array. The second example is an image dataset, where the samples are represented by a pair of image and label. Thus, it is worth noting that the Dataset class is flexible enough to handle different types of data.

Responsibilities of the `Dataset` class

Although the Dataset class is flexible enough to handle different types of data, the API of the Dataset class does not strictly define how samples should be returned, nor how data should be loaded, organized, and read, as this depends on the problem being solved.

In summary, an object of the Dataset class is responsible for handling four factors:

Structure of storage device data: which is how data is organized on the storage device. For example: when dealing with images, data is usually organized in directories, where each directory represents a class and contains the images corresponding to that class; when dealing with volumetric data, data can be a single 3D image file or a collection of 2D image files.
Data loading: which is how data is loaded. For example: when dealing with images, data can be loaded using the PIL or opencv library; when dealing with volumetric data, data can be loaded using the numpy or zarr library. In general, loading is done through functions that read data from the storage device and return it in a suitable format for preprocessing. Typically, the loading method is closely related to the file extensions used.
Data preprocessing: which are the operations performed on the data before being returned. This includes file selection and application of selection filters, as well as transformations on a sample, for example, normalization, resizing, cropping, rotation, among others.
Data return: which is how data is returned. Different datasets may return data in different ways. For example: an image classification dataset may return a pair of tensors, where the first tensor is the image and the second tensor is the label; an image segmentation dataset may return a pair of tensors, where the first tensor is the image and the second tensor is the mask; an unsupervised learning dataset may return a single tensor, which is the image.

I suggest that our datasets follow these responsibilities and implement the Dataset class API appropriately for the problem being solved.

However, I still can't see how this can be done in a generic way, since the structure of the storage device data, data loading, and data preprocessing are specific to each problem. Perhaps we can first discuss these responsibilities and create abstract or intermediate interfaces to handle them.

otavioon commented 8 months ago

Note that Dataset must perform loading operations from Reader. Thus, readers should be parameters of Generic Datasets.

otavioon commented 8 months ago

I made a Generic Dataset class that uses readers and transforms as we have discussed. Check #23 and see if it makes sense. We can discuss here or in the PR.

discovery-unicamp / Minerva

Dataset #13

Dataset

Responsibilities of the `Dataset` class

discovery-unicamp / Minerva

Dataset #13

Dataset

Responsibilities of the Dataset class

Responsibilities of the `Dataset` class