discovery-unicamp / Minerva

Minerva is a framework for training machine learning models for researchers.
https://discovery-unicamp.github.io/Minerva/
MIT License
3 stars 7 forks source link

Dataset #13

Closed GabrielBG0 closed 8 months ago

GabrielBG0 commented 8 months ago

Dataset

The Dataset class is responsible for loading data, preprocessing it, and returning it in a suitable format for training. For implementing a new dataset, PyTorch suggests an abstract class called torch.utils.data.Dataset that should be inherited to create a new dataset. This class defines a dataset mapping indices to samples and implies the implementation of two methods: __len__ and __getitem__. The __len__ method should return the size of the dataset, i.e., the number of samples, and the __getitem__ method should return the sample at index i from the dataset.

Below are two examples of datasets that inherit from the torch.utils.data.Dataset class:


import numpy as np
import PIL
from torch.utils.data import Dataset

class SimpleDataset(Dataset):
    def __init__(self, data: np.ndarray):
        """
        Args:
            data (np.ndarray): 2D array of data, where the first dimension is 
            the samples and the second dimension is the features.
        """
        self.data = data

    def __len__(self) -> int:
        return len(self.data)

    def __getitem__(self, idx: int) -> np.ndarray:
        """
        Args:
            idx (int): index of the sample to return.
        Returns:
            np.ndarray: the sample at the given index (1D numpy array).
        """
        return self.data[idx]

class LabeledImageDataset(Dataset):
    def __init__(self, image_files: List[str], labels: List[int]):
        """
        Args:
            image_files (List[str]): list of paths to image files.
            labels (List[int]): list of labels.
        """
        self.image_files = image_files
        self.labels = labels

    def __len__(self) -> int:
        return len(self.image_files)

    def __getitem__(self, idx: int) -> Tuple[np.ndarray, int]:
        """
        Args:
            idx (int): index of the sample to return.
        Returns:
            Tuple[PIL.Image.Image, int]: the image and its label.
        """
        image = PIL.Image.open(self.image_files[idx])
        label = self.labels[idx]
        return (image, label)

dataset_1 = SimpleDataset(np.random.rand(100, 10))
len(dataset_1)  # 100
sample = dataset_1[0]  # 1D numpy array

dataset_2 = LabeledImageDataset(["image1.jpg", "image2.jpg"], [0, 1])
len(dataset_2)  # 2
sample = dataset_2[0]  # (np.ndarray, int)

Note that we have two examples of datasets. The first example is a simple dataset, where the samples are represented by a 1D numpy array. The second example is an image dataset, where the samples are represented by a pair of image and label. Thus, it is worth noting that the Dataset class is flexible enough to handle different types of data.

Responsibilities of the Dataset class

Although the Dataset class is flexible enough to handle different types of data, the API of the Dataset class does not strictly define how samples should be returned, nor how data should be loaded, organized, and read, as this depends on the problem being solved.

In summary, an object of the Dataset class is responsible for handling four factors:

I suggest that our datasets follow these responsibilities and implement the Dataset class API appropriately for the problem being solved.

However, I still can't see how this can be done in a generic way, since the structure of the storage device data, data loading, and data preprocessing are specific to each problem. Perhaps we can first discuss these responsibilities and create abstract or intermediate interfaces to handle them.

otavioon commented 8 months ago

Note that Dataset must perform loading operations from Reader. Thus, readers should be parameters of Generic Datasets.

otavioon commented 8 months ago

I made a Generic Dataset class that uses readers and transforms as we have discussed. Check #23 and see if it makes sense. We can discuss here or in the PR.