Unsupervised pre-training on custom data

facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.

https://vissl.ai

MIT License

3.25k stars 332 forks source link

Unsupervised pre-training on custom data #559

Closed rram12 closed 2 years ago

rram12 commented 2 years ago

❓ Unsupervised pre-training on custom data

From the documentation, Colabs, and the issues I read in this repo, I don't understand why we need to provide the dataset in a structured way with the labels (the same as the following).

data/ ├── train/ │ ├── class0/ └── *.png │ └── class1/ └── *.png

└── val/ ├── class0/ | └── .png └── class1/ └── .png

In my example, I have an unlabeled dataset of 1 million images and I can't really label them. My question is can we use it directly with all the images under one folder data (or just data/train and data/test)?

QuentinDuval commented 2 years ago

Hi @rram12,

First of all, thank you for using VISSL!

As you noticed, this folder structure is indeed a bit awkward but will work in your case nevertheless. You can create the following structure by considering that "no label" is equivalent to "one dummy label":

data/
   train/
      0/
         *.png
   val/
      0/
         *.png

You can actually omit the val folder if you don't have any, there should be no problem with that.

Please tell me if this unblocks you,

Thank you, Quentin

rram12 commented 2 years ago

@QuentinDuval Thank you for your response. That's the way I was doing the training with "no label". I was just wondering why people seem to structure their dataset in the "supervised learning" way (e.g. #323) even though they are doing the "unsupervised learning". It is all clear now!