FIX_45 - Githubissues

franchuterivera commented 3 years ago

Address #45
Address #27
Start addressing #47
Fixes a bug where no tabular preprocessing happens. Added checks in the pipeline to make sure this does not happen again. @ravinkohli I think the bug was due to a naming resolution problem of some objects/method sharing names, so I have renamed couple of this here and there.

franchuterivera commented 3 years ago

@bastiscode I have made some changes to the dataset to make it work with inference.

So a subset is not a transformable dataset -- that can incorporate nicely train/test transformations, so I did some changes (as minimal as I could). Could you take a look?

Also, do you think you can let me know how do on can create an image dataset in:

examples/example_image_classification.py

Currently, the data that is there is so that we have a (60000, 28, 28) set of images with their separate 60000 targets. But, If I read this correctly https://github.com/LMZimmer/Auto-PyTorch_refactor/blob/ac7a9ce35e87a428caca2ac108b362a54d3b8f3a/autoPyTorch/datasets/image_dataset.py#L53 expects something like

zip(images, targets)

Which contradicts what the base dataset <__getitem__> expects. Do you have a suggestion on how to create the dataset there?

bastiscode commented 3 years ago

@franchuterivera See my PR #53 for more details on my current implementation of the datasets in the data_dev branch. I will base my answer on what should work once this PR is merged.

So for the image classification example, specifying a tuple of numpy arrays (images, labels) with shape of images being [60000, 1, 28, 28] should work. Before mean and std is calculated the new _BaseImageDataset (https://github.com/LMZimmer/Auto-PyTorch_refactor/pull/53/files#diff-ef12f1ad4806da1ac439fe4cc40a33170c3182d6eb1c0d2cb317b468e6ed5435) (and also the old image dataset) puts the given inputs into an internal torch.utils.Dataset, except if you already input a torch.utils.Dataset. So actually the code torch.var_mean(train[i][0]) already acts on this internal dataset and should work as expected (I also tested this locally on data_dev branch).

And regarding the transforms, I dont think that we should put torchvision transforms as arguments for the base dataset, because this dataset is the base also for other datasets. If anywhere I think we should put them into the image datasets where they are used. I don't know if we should have transforms as part of the datasets in general, is this common practice or would it be better to have them separated from the dataset?

franchuterivera commented 3 years ago

@franchuterivera See my PR #53 for more details on my current implementation of the datasets in the data_dev branch. I will base my answer on what should work once this PR is merged.

Ok cool, then once this is merged we can enable image dataset

And regarding the transforms, I dont think that we should put torchvision transforms as arguments for the base dataset, because this dataset is the base also for other datasets. If anywhere I think we should put them into the image datasets where they are used. I don't know if we should have transforms as part of the datasets in general, is this common practice or would it be better to have them separated from the dataset?

torchvision is just a vessel for the need to have batched data be transformed (iterating over transformations and applying on the indexed data). A transformation is just a callable that modifies the data (not only for augmentation but for preprocessing). If the dataset can fit in memory, I completely agree that preprocessing should happen early in the game. Nevertheless, if the dataset is huge we will have to load it on a batch basis, and here is where the need for a batch transformation arises.

So where to have this transformation -- well in the general case dataset needs transformations (augmentation for image/time series) and there is some work for tabular augmentation that helps reduce overfit (even autogluon uses this!). So it made sense to have it there as it is just another callable.

So the need for transformation is there, where to apply it is something we can decide. How other tools do this:

Autosklearn applys this in the pipeline itself, assuming all data always fits in memory. We struggle there with big datasets.
Autogluon does it as part of the data preprocessing in a separate custom class

Do you have another suggestion or see a hard stop with this approach of doing the batched transformations in the dataset?

bastiscode commented 3 years ago

I totally agree on the need for transformations. I usually do stuff like this with a custom collate_fn in the pytorch dataloaders. Would it also be a possibility to do it there?

franchuterivera commented 3 years ago

Anything is possible :)

I will just like to stick to the sample place to do transformations (torchvision included). Like if we decide to use a transform in the dataset (through get item, which will be applied to each unit of data) or a transform in the dataloader (through a factory of collate functions, as we have to yield a function that by itself instantiates the transformation).

I do not like the factory approach because they are hard to debug with multiprocessing so that I why I was inclined for the first approach. But maybe @LMZimmer has experience/knows where is more intuitive to perform the transformations?

As I said, both ways are possible and it is just a matter of design choice.

LMZimmer commented 3 years ago

I think the most common way is to have the custom dataset inheriting from torch.utils.data.Dataset with an attribute .transforms that are applied when __getitem__() is called. I'd also prefer that one.

If we have different use cases for our dataset (as input for dataloader with batch-wise transforms, getting data splits), we could do one of the following:

use an attribute similar to torch.nn.Module with model.train() / model.evaluate(). We could have e.g. dataset.activate_transforms() and dataset.deactivate_transfors() or similar
use __getitem__() only for dataloading with augmentations and provide other methods for other use cases

franchuterivera commented 3 years ago

Ok, super. My PR implements this already.

I think we all are doing changes in the Datasets, and it would be good if they can be centralized by @bastiscode ?

I do not know if the direction should be to first merge #53 and then #52?

LMZimmer commented 3 years ago

Given that this PR is ready and the changes with the transforms seem pretty important, I tend towards merging this now. Also the changes in #53 seem easy to adapt to this PR. Could we just fix the example error in this PR first?

franchuterivera commented 3 years ago

I will make my changes work after Ravin's Smac PR (because he also made my same changes). This will fix this problem also.

LMZimmer / Auto-PyTorch_refactor

FIX_45 #52