R1chrdson / vesuvius_challenge

The Vesuvius Challenge - Ink Detection kaggle competition from the "UCU dropouts" team
0 stars 0 forks source link

Implement training DataSet logic #8

Open R1chrdson opened 1 year ago

R1chrdson commented 1 year ago
  1. Prepare a script that converts the tiff data volume format, provided in Kaggle to the optimal data format for data loading purposes.

    • The script should generate a folder with files related initial dataset, and take into account the tile size used to split the whole fragment
    • Also, there should be some logic to avoid tiles outside the mask
    • Should support k-folds cross-validation splits. Ideally for training dataset, there should be k-folds folders and also hold out data. It's OK to perform data split on preprocessing step.
  2. Implement the DataSet logic from the Dataset class from pytorch: from torch.utils.data import Dataset

    • Basically, you should implement the __getitem__ and __len__ methods, with respect to the data format you implemented in step 1.

The dataset, generated by the script should be reusable, so we can load it in many places and use it for the cross-validation process while tuning the models.

You can do this task gradually, first, it's possible to implement the version without mask processing logic. Also, the main challenge of this task is to define the correct file format to store all the samples, so there would be a balance between the number of files and data loading time. For instance, if you will store every single tile as a separate file, then for the corner case with tile 1x1 you will get the number of files 8000x6000 which is a significant overhead.