facebookresearch / deepcluster

Deep Clustering for Unsupervised Learning of Visual Features
Other
1.69k stars 325 forks source link

Shuffling training data #6

Closed ds2268 closed 6 years ago

ds2268 commented 6 years ago

Hi @mathildecaron31

First of all thank you for making this research code available to the wider community. I have couple of questions/issues that I want to address:

1. Computing features

https://github.com/facebookresearch/deepcluster/blob/f5995e954842054d88aa9fcc9ff7ba2db7eafc9e/main.py#L299-L300

For the first batch you only initialize numpy array and you don't save the computed features. After initialization you should probably add a line to also insert computed features for the first batch.

2. Shuffling training data

I see that you are not shuffling training data which can make models more general i.e. less overfitting. For the ImageNet and the amount of data this is probably not so important. I noticed that simply adding shuffle=True to DataLoader will not suffice as _deepcluster.imageslists indexes are simply ordered indexes of the computed features (https://github.com/facebookresearch/deepcluster/blob/f5995e954842054d88aa9fcc9ff7ba2db7eafc9e/clustering.py#L207-L208). When creating dataset with new pseudolabels this indices are used which are not correct then as the order of the computed features don't correspond to the actual index in the dataset.

I just wanted to note that down as one may simply put shuffle=True in DataLoader.

3. Smaller datasets

Have you tried your method on any smaller datasets e.g. initializing it with supervised model learned on ImageNet and then unsuspervised fine tuning on a new dataset? Any success with such smaller datasets?

mathildecaron31 commented 6 years ago

Hi, 1. Computing features For the first batch, the features are saved in the numpy array features here because 0 < len(dataloader) - 1.

2. Shuffling training data The data are shuffled in the Sampler UnifLabelSampler. As a Sampler is specified in my dataloader train_dataloader, the shuffle flag must be set to False. (See Pytorch doc of torch.utils.data.DataLoader).

3. Smaller datasets No we haven't experimented on this setting.

Thank you for your interest :)