Closed ds2268 closed 6 years ago
Hi,
1. Computing features
For the first batch, the features are saved in the numpy array features
here because 0 < len(dataloader) - 1
.
2. Shuffling training data
The data are shuffled in the Sampler UnifLabelSampler. As a Sampler is specified in my dataloader train_dataloader
, the shuffle flag must be set to False. (See Pytorch doc of torch.utils.data.DataLoader).
3. Smaller datasets No we haven't experimented on this setting.
Thank you for your interest :)
Hi @mathildecaron31
First of all thank you for making this research code available to the wider community. I have couple of questions/issues that I want to address:
1. Computing features
https://github.com/facebookresearch/deepcluster/blob/f5995e954842054d88aa9fcc9ff7ba2db7eafc9e/main.py#L299-L300
For the first batch you only initialize numpy array and you don't save the computed features. After initialization you should probably add a line to also insert computed features for the first batch.
2. Shuffling training data
I see that you are not shuffling training data which can make models more general i.e. less overfitting. For the ImageNet and the amount of data this is probably not so important. I noticed that simply adding shuffle=True to DataLoader will not suffice as _deepcluster.imageslists indexes are simply ordered indexes of the computed features (https://github.com/facebookresearch/deepcluster/blob/f5995e954842054d88aa9fcc9ff7ba2db7eafc9e/clustering.py#L207-L208). When creating dataset with new pseudolabels this indices are used which are not correct then as the order of the computed features don't correspond to the actual index in the dataset.
I just wanted to note that down as one may simply put shuffle=True in DataLoader.
3. Smaller datasets
Have you tried your method on any smaller datasets e.g. initializing it with supervised model learned on ImageNet and then unsuspervised fine tuning on a new dataset? Any success with such smaller datasets?