Are Labels Required in Self-Supervised Pretraining?

Edit I found a closed issue https://github.com/facebookresearch/vissl/issues/452 regarding similar problems, the clusterfit run error did not occur again. However, I still have some doubts about the data folder so I edited the title.

Questions 1.I originally thought that I don't need labelled data to do pretraining tasks. However, in this page of the vissl documentation, (https://vissl.readthedocs.io/en/v0.1.5/getting_started.html#running-simclr-pre-training-on-1-gpu), the ImageNet-1K dataset is assumed to look like this

train/
    <n0......>/
        <im-1-name>.JPEG
        ...
        <im-N-name>.JPEG
        ...
    <n1......>/
        <im-1-name>.JPEG
        ...
        <im-M-name>.JPEG
        ...
    ...
val/
    <n0......>/
        <im-1-name>.JPEG
        ...
        <im-N-name>.JPEG
        ...
    <n1......>/
        <im-1-name>.JPEG
        ...
        <im-M-name>.JPEG
        ...
    ...

The tutorial works fine when I use the imagenet1k dataset. The problem is that the data I plan to use has a large portion of unlabelled images. I understand that I don't need to incorporate the labels in pretraining commands, but I still need the labels (which I won't have access to) to put every image in the correct folder (class). Is their a way to avert this?

I tried to pretrain simclr with a data folder which looks like this

{Wall_Data}
train/
    pretrain/
        <im-1-name>.JPEG
        ...
        <im-N-name>.JPEG
        ...
        <im-1-name>.JPEG
        ...
        <im-M-name>.JPEG
        ...
    ...
val/
    <n0......>/
        <im-1-name>.JPEG
        ...
        <im-N-name>.JPEG
        ...
    <n1......>/
        <im-1-name>.JPEG
        ...
        <im-M-name>.JPEG

I put all the unlabelled data in the "pretrain" folder. The training works, however I am not sure how arranging the data in this format affects the result.

Also the checkpoint.torch file is always broken when I pretrain with simclr, but I can still somehow obtain the model_phaseXX.torch for each epoch. <-Is this problematic? Screenshot from 2022-01-17 14-53-31

The test set comes from the val folder, the validation set comes from the train folder.<- Is this correct?
I extracted features using Clusterfit following the instructions in this closed issue https://github.com/facebookresearch/vissl/issues/452, the resulting output are numpy files. Am I done training clusterfit or further steps are required? Since I don't really understand how to make use of these files.

Hi @Cthollyz,

First of all, thank you for using the repository and thank you even more for raising the issue !

Let me address each of your points one by one.

For point 1

Indeed, the disk_folder follows the format of Pytorch and requires a sub-folder structure. It is clearly not optimal when doing pre-training on unlabelled data. The trick you mentioned of having a "pretrain" folder as only "class" should work fine, and should not affect the results in any way as far as my understanding goes (since the labels are not used).

You can also use alternative formats such as "disk_filelist" which requires to have a ".npy" file holding a list of image paths, and if you have labels, a ".npy" file holding the corresponding labels (in integer or string format).

Here is one example:

    "inaturalist2018_filelist": {
        "train": ["/inaturalist/train_images.npy", "/inaturalist/train_labels.npy"],
        "val": ["/inaturalist/val_images.npy", "/inaturalist/val_labels.npy"]
    },

You can omit the "*_labels.npy" if you do not have such labels.

For point 2

The content of the "dataset_catalog.json" is what matters in defining what split ("train" or "val") is taken from which directory. If the file contains this:

    "my_dataset": {
        "train": ["/path/to/my/dataset/train", "<lbl_path>"],
        "val": ["/path/to/my/dataset/test", "<lbl_path>"]
    },

Then the "val" split will be in fact the "test" folder, and the "train" split the "train" folder. By changing the dataset catalog (or adding multiple entries for multiple splits), you can have different mappings for the same dataset files.

For point 3

The ClusterFit step will produce ".npy" files that are compatible with the "disk_filelist" format that I mentioned for point 1. What ClusterFit does is basically create a new dataset associating images to labels, labels that are derived from the clusters of features.

Then you can use these files to do supervised training on this new dataset. To do so, create a new entry in the "dataset_catalog.json" like so:

    "cluster_filelist": {
        "train": ["/out_of_cluster_fit/train_images.npy", "/out_of_cluster_fit/train_labels.npy"],
        "val": ["/out_of_cluster_fit/val_images.npy", "/out_of_cluster_fit/val_labels.npy"]
    },

You can then run a pre-training on the dataset named "cluster_filelist".

I hope this helps ! If not, don't hesitate to ask :)

I am sorry for my late response. Thank you for your informative response, helped a lot to understand VISSL better.

facebookresearch / vissl