Question on DataloaderCreator - How to create test sets

just-eoghan commented 2 years ago

Hello,

Well done on putting together this library I think it will be extremely useful for many people undertaking domain adaptation projects.

I am wondering how to create a test dataset using the DataloaderCreator class?

Some background on my issue.

I am using the MNISTM example within a PyTorch lightning data-module.

Adapting the code from the examples/DANNLightning.ipynb I have the following code.

class MnistAdaptDataModule(LightningDataModule):
    def __init__(
        self,
        data_dir: str = "data/mnistm/",
        batch_size: int = 4,
        num_workers: int = 0,
        pin_memory: bool = False,
    ):
        super().__init__()

        # this line allows to access init params with 'self.hparams' attribute
        # it also ensures init params will be stored in ckpt
        self.save_hyperparameters(logger=False)

        self.data_train: Optional[Dataset] = None
        self.data_val: Optional[Dataset] = None
        self.data_test: Optional[Dataset] = None
        self.dataloaders = None

    def prepare_data(self):
        if not os.path.exists(self.hparams.data_dir):
            print("downloading dataset")
            get_mnist_mnistm(["mnist"], ["mnistm"], folder=self.hparams.data_dir, download=True)
        return

    def setup(self, stage: Optional[str] = None):
        if not self.data_train and not self.data_val and not self.data_test:
            datasets = get_mnist_mnistm(["mnist"], ["mnistm"], folder=self.hparams.data_dir, download=False)
            dc = DataloaderCreator(batch_size=self.hparams.batch_size, num_workers=self.hparams.num_workers)
            validator = IMValidator()
            self.dataloaders = dc(**filter_datasets(datasets, validator))
            self.data_train = self.dataloaders.pop("train")
            self.data_val = list(self.dataloaders.values())
            return            

    def train_dataloader(self):
        return self.data_train

    def val_dataloader(self):
        return self.data_val

   def test_dataloader(self):
        # how to make a test dataset?
        return

self.dataloaders produces the following object

{'src_train': SourceDataset(
  domain=0
  (dataset): ConcatDataset(
    len=60000
    (datasets): [Dataset MNIST
        Number of datapoints: 60000
        Root location: /home/eoghan/Code/mnist-domain-adaptation/data/mnist_adapt/
        Split: Train
        StandardTransform
    Transform: Compose(
                   Resize(size=32, interpolation=bilinear, max_size=None, antialias=None)
                   ToTensor()
                   <pytorch_adapt.utils.transforms.GrayscaleToRGB object at 0x7fd1badcbdc0>
                   Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
               )]
  )
), 'src_val': SourceDataset(
  domain=0
  (dataset): ConcatDataset(
    len=10000
    (datasets): [Dataset MNIST
        Number of datapoints: 10000
        Root location: /home/eoghan/Code/mnist-domain-adaptation/data/mnist_adapt/
        Split: Test
        StandardTransform
    Transform: Compose(
                   Resize(size=32, interpolation=bilinear, max_size=None, antialias=None)
                   ToTensor()
                   <pytorch_adapt.utils.transforms.GrayscaleToRGB object at 0x7fd1badcb6a0>
                   Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
               )]
  )
), 'target_train': TargetDataset(
  domain=1
  (dataset): ConcatDataset(
    len=59001
    (datasets): [MNISTM(
      domain=MNISTM
      len=59001
      (transform): Compose(
          ToTensor()
          Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      )
    )]
  )
), 'target_val': TargetDataset(
  domain=1
  (dataset): ConcatDataset(
    len=9001
    (datasets): [MNISTM(
      domain=MNISTM
      len=9001
      (transform): Compose(
          ToTensor()
          Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      )
    )]
  )
), 'train': CombinedSourceAndTargetDataset(
  (source_dataset): SourceDataset(
    domain=0
    (dataset): ConcatDataset(
      len=60000
      (datasets): [Dataset MNIST
          Number of datapoints: 60000
          Root location: /home/eoghan/Code/mnist-domain-adaptation/data/mnist_adapt/
          Split: Train
          StandardTransform
      Transform: Compose(
                     Resize(size=32, interpolation=bilinear, max_size=None, antialias=None)
                     ToTensor()
                     <pytorch_adapt.utils.transforms.GrayscaleToRGB object at 0x7fd125f69d60>
                     Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
                 )]
    )
  )
  (target_dataset): TargetDataset(
    domain=1
    (dataset): ConcatDataset(
      len=59001
      (datasets): [MNISTM(
        domain=MNISTM
        len=59001
        (transform): Compose(
            ToTensor()
            Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        )
      )]
    )
  )
)}

This handles train and val for source and target as well as creating a conjoined train dataset.

Going by the example ipynb, the concat dataset for train (of source and target) is used as the training dataset for the model.

The validation set is a list of the remaining keys in the data-loader and has the following form.

[
<torch.utils.data.dataloader.DataLoader object at 0x7fd1063e6b80> {
    dataset: TargetDataset(
  domain=1
  (dataset): ConcatDataset(
    len=59001
    (datasets): [MNISTM(
      domain=MNISTM
      len=59001
      (transform): Compose(
          ToTensor()
          Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      )
    )]
  )
)
}
]

I am not sure why this is the validation dataset, Do we validate on only the target domain? How would we handle this validation set if the target domain is unlabelled? If you could explain why this is the case I would appreciate some insight.

In summation I am looking for guidance on is how to use something like a torch.utils.data.random_split to take some of the source and target data and use the DataloaderCreator to pass back test sets along with train and val, is this possible within the framework?

Many thanks, Eoghan

KevinMusgrave commented 2 years ago

I am not sure why this is the validation dataset, Do we validate on only the target domain? How would we handle this validation set if the target domain is unlabelled? If you could explain why this is the case I would appreciate some insight.

This is an unsolved problem in unsupervised domain adaptation. We want high accuracy on the unlabeled target domain, but since it is unlabeled, it is difficult to determine the model's performance.

Whether or not we validate only on the target domain depends on the type of validator. The IMValidator uses only the target domain to compute a validation score, which is why the validation dataloaders returned by filter_datasets consists of only the target domain:

validator = IMValidator() # uses only the target domain to compute a validation score
self.dataloaders = dc(**filter_datasets(datasets, validator))

You could use a validator that adds source val accuracy plus the IM score:

from pytorch_adapt.validators import MultipleValidators, AccuracyValidator, IMValidator
validator = MultipleValidators([AccuracyValidator(), IMValidator()])
self.dataloaders = dc(**filter_datasets(datasets, validator))

Now the validation dataloaders should consist of the src_val set and the target_train set. Note that the target_train set is used for validation, because it is assumed that the target_val set is reserved for testing. (This is a bit confusing, however it's the most realistic setting in my opinion. I can expand on this if you want.)

You can make the IMValidator use target_val instead of target_train like this:

validator = IMValidator(key_map={"target_val": "target_train"})

In summation I am looking for guidance on is how to use something like a torch.utils.data.random_split to take some of the source and target data and use the DataloaderCreator to pass back test sets along with train and val, is this possible within the framework?

You can split the datasets however you want. As long as the DataloaderCreator recognizes the names of the splits you pass in:

dc = DataloaderCreator(val_names=["src_val", "target_val", "src_test", "target_test"])
dataloaders = dc(src_val=dataset1, target_val=dataset2, src_test=dataset3, target_test=dataset4)

Let me know if you have more questions!

just-eoghan commented 2 years ago

Hi Kevin,

Thanks for your answer it clears everything up nicely.

I think some of those examples would be a super inclusion in the docs!

It would be good to get a little more info on this

it is assumed that the target_val set is reserved for testing

As it's still somewhat confusing to me.

KevinMusgrave commented 2 years ago

I think some of those examples would be a super inclusion in the docs!

Yeah the docs need to be updated very badly. It's something I should prioritize.

It would be good to get a little more info on this

it is assumed that the target_val set is reserved for testing

As it's still somewhat confusing to me.

In a real world application, you have unlabeled target data that is not yet split into train/val/test etc. To maximize performance, it's best to train on as much target data as possible. So I would use all the target data for training, and validate on it as well. If I split the data into train/val, then I'm reducing the amount of target data to train on, which will likely reduce performance.

Would this cause overfitting on the target training set? My opinion is no, because:

Unsupervised transfer learning is difficult
The existing validation methods are noisy, meaning that they are not well correlated with target accuracy

After the model is trained, you deploy it in your application. Now your model is fed unseen data, i.e. the test set. So in an academic setting, I train and validate on "target_train", and use "target_val" as the unseen data. It could be called "target_test" instead.

just-eoghan commented 2 years ago

That's great I understand that approach now.

I think what was catching me was that I personally always use "test" as the name for the hold-out set and "validation" for the set of unseen data used during training to get a feel for performance. But validation and test are interchangeable terms for the hold-out set so either works.

On the docs, if you are interested I may be able to help with them. I've just found the library this week but I've been reading through the code-base to better understand it so I may be useful enough to assist with the docs eventually!

Thank you for your help.

KevinMusgrave commented 2 years ago

On the docs, if you are interested I may be able to help with them. I've just found the library this week but I've been reading through the code-base to better understand it so I may be useful enough to assist with the docs eventually!

That would be greatly appreciated! I'll try to update the docs this weekend. Right now it's way out of date and also a hassle to edit. I'll let you know when I make the change.

KevinMusgrave commented 2 years ago

@deepseek-eoghan I updated the docs. Re: contributing to the docs, that would be very helpful, but I think example jupyter notebooks would be even better since I assume that's where most people look first.

KevinMusgrave / pytorch-adapt

Question on DataloaderCreator - How to create test sets #55

Some background on my issue.