CvNoob / SFDA-DE

A PyTorch implementation for CVPR2022 paper Source-Free Domain Adaptation via Distribution Estimation.
Apache License 2.0
18 stars 1 forks source link

Did you use train data to test in dataset Office-31 and other datasets? #1

Open IshiKura-a opened 1 year ago

IshiKura-a commented 1 year ago

In prepare_data, we have train dataloader

    target = cfg.DATASET.TARGET_NAME
    dataroot_T = os.path.join(cfg.DATASET.DATAROOT, target)

    with open(os.path.join(cfg.DATASET.DATAROOT, 'category.txt'), 'r') as f:
        classes = f.readlines()
        classes = [c.strip() for c in classes]
    assert(len(classes) == cfg.DATASET.NUM_CLASSES)

    # for clustering
    batch_size = cfg.CLUSTERING.TARGET_BATCH_SIZE
    dataset_type = cfg.CLUSTERING.TARGET_DATASET_TYPE
    print('Building clustering_%s dataloader...' % target)
    dataloaders['clustering_' + target] = CustomDatasetDataLoader(
                dataset_root=dataroot_T, dataset_type=dataset_type,
                batch_size=batch_size, transform=train_transform,
                train=False, num_workers=cfg.NUM_WORKERS,
                classnames=classes)

and test loader

    test_domain = cfg.TEST.DOMAIN if cfg.TEST.DOMAIN != "" else target
    dataroot_test = os.path.join(cfg.DATASET.DATAROOT, test_domain)
    dataloaders['test'] = CustomDatasetDataLoader(
                    dataset_root=dataroot_test, dataset_type=dataset_type,
                    batch_size=batch_size, transform=test_transform,
                    train=False, num_workers=cfg.NUM_WORKERS,
                    classnames=classes)

In clustering, this work use dataloaders['clustering_' + target] directly. Therefore, if cfg.TEST.DOMAIN is unset, this two dataloaders will have the same dataset, which means applying test on train data. Is this a bug or I have just made a mistake ?

IshiKura-a commented 1 year ago

I tried to change the test dataset by some naive changes in the code. In function prepare_data_Anchor() in prepare_data.py, I simply append this before return dataloaders

    if test_domain == target:
        print('!!!!!! Change dataset of target domain and test')
        dataset = dataloaders['clustering_' + target].dataset
        n = len(dataset)
        idx_perm = torch.randperm(n)
        test_idx = idx_perm[:int(0.1 * n)]
        train_idx = idx_perm[int(0.1 * n):]
        train_data = Subset(dataset, train_idx)
        setattr(train_data.dataset, 'l', len(train_idx))
        test_data = Subset(dataloaders['test'].dataset, test_idx)
        setattr(test_data.dataset, 'l', len(test_idx))

        dataloaders['clustering_' + target].dataset = train_data
        loader = dataloaders['clustering_' + target].dataloader

        dataloaders['clustering_' + target].dataloader = DataLoader(train_data, batch_sampler=loader.batch_sampler,
                                                                    num_workers=loader.num_workers)
        dataloaders['test'].dataset = test_data
        loader = dataloaders['test'].dataloader
        dataloaders['test'].dataloader = DataLoader(test_data, batch_sampler=loader.batch_sampler,
                                                    num_workers=loader.num_workers)

And in single_dataset.py I rewrite __len__() of BaseDataset and BaseDatasetWithoutLabel (otherwise there would be an IndexError):

    def __len__(self):
        try:
            return getattr(self, 'l')
        except AttributeError:
            return len(self.data_paths)

On the Office-31 Dataset, I set the loop to be 50 rather than 1000, and get the results:

image

And your results would be:

image

There's a 3% performance degradation.