Shuffling training data

Hi @mathildecaron31

First of all thank you for making this research code available to the wider community. I have couple of questions/issues that I want to address:

1. Computing features

https://github.com/facebookresearch/deepcluster/blob/f5995e954842054d88aa9fcc9ff7ba2db7eafc9e/main.py#L299-L300

For the first batch you only initialize numpy array and you don't save the computed features. After initialization you should probably add a line to also insert computed features for the first batch.

2. Shuffling training data

I see that you are not shuffling training data which can make models more general i.e. less overfitting. For the ImageNet and the amount of data this is probably not so important. I noticed that simply adding shuffle=True to DataLoader will not suffice as _deepcluster.imageslists indexes are simply ordered indexes of the computed features (https://github.com/facebookresearch/deepcluster/blob/f5995e954842054d88aa9fcc9ff7ba2db7eafc9e/clustering.py#L207-L208). When creating dataset with new pseudolabels this indices are used which are not correct then as the order of the computed features don't correspond to the actual index in the dataset.

I just wanted to note that down as one may simply put shuffle=True in DataLoader.

3. Smaller datasets

Have you tried your method on any smaller datasets e.g. initializing it with supervised model learned on ImageNet and then unsuspervised fine tuning on a new dataset? Any success with such smaller datasets?

facebookresearch / deepcluster

Shuffling training data #6