facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.74k stars 751 forks source link

Train model with data of no labels? #142

Open ambipomyan opened 1 year ago

ambipomyan commented 1 year ago

Hi, I have a question about how to handle the dataset and label.txt when I want to train the model. I am a bit confused when I see the settings for labels since the training is of unsupervised learning, then how I handle the data structure of the dataset and contexts of label.txt for training give that I have only data with no label?

I guess the labels should not do anything for the unsupervised training and now I just put the inputs, which are images with no label, into separated folders and recognize the name of the folders as the name of class in order to fit the format of dinov2 inputs. In this way, the name of class and the images in the folders are definitely not matching. Will this work for training?

Thank you in advance!

dataset

- ROOT
   |-- train
   |   |-- folder0
   |   |    |-- folder0_01.jpeg
   |   |    `-- ...
   |   |-- folder1
   |   |    |-- folder1_01.jpeg
   |   |    `-- ...
   |   `-- ...
   |-- val
   |-- test
   `-- label.txt

label.txt

folder0, folder0
folder1, folder1
...
yyyyyyfs commented 1 year ago

As far as I know,training or finetuning this model is not self-supervised. It is also need label,so if you want to train your dataset, you should generate your labels.txt, and modify the data_loader to adapt your dataset.

ddstone commented 11 months ago

As far as I know,training or finetuning this model is not self-supervised. It is also need label,so if you want to train your dataset, you should generate your labels.txt, and modify the data_loader to adapt your dataset.

After i checked the source code, i.e., the training code in ssl_meta_arch.py, what i found is the training process did not use the label but only images, which is the definition of self-supervised learning. I'm not sure the purpose of the label of dataset , but i think there is a way to leverage the project to pretrain on custom dataset with no label.

TheoMoutakanni commented 11 months ago

Hello, The training of DINOv2 is fully unsupervised. We don't use labels during the training but we use them during the validation process where we fit a linear layer on top of our frozen features to check that the model is training well. You shouldn't need labels to launch a training of DINOv2, our datasets have labels for evaluation but if you don't have labels you can return something like '0' or 'np.nan'.

Look at this file to see what you need for a minimal dataset: https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/datasets/extended.py

To make a new dataset without labels, you would have to make a class that inherits ExtendedVisionDataset and create functions for get_image_data (should return an image as an array), get_target (in your case this should return 0 or nan) and __len__ (this should returns the lenght of the dataset).

You can take inspiration from the ImageNet dataset class in https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/datasets/image_net.py

But keep in mind that you don't need to copy all functions in the ImageNet dataset.

ddstone commented 11 months ago

Hello, The training of DINOv2 is fully unsupervised. We don't use labels during the training but we use them during the validation process where we fit a linear layer on top of our frozen features to check that the model is training well. You shouldn't need labels to launch a training of DINOv2, our datasets have labels for evaluation but if you don't have labels you can return something like '0' or 'np.nan'.

Look at this file to see what you need for a minimal dataset: https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/datasets/extended.py

To make a new dataset without labels, you would have to make a class that inherits ExtendedVisionDataset and create functions for get_image_data (should return an image as an array), get_target (in your case this should return 0 or nan) and __len__ (this should returns the lenght of the dataset).

You can take inspiration from the ImageNet dataset class in https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/datasets/image_net.py

But keep in mind that you don't need to copy all functions in the ImageNet dataset.

Appreciate your rely, i've launched the training with dataset without labels, by mocking the image_net.py. Thank you!

HDL-YD commented 10 months ago

Hello, The training of DINOv2 is fully unsupervised. We don't use labels during the training but we use them during the validation process where we fit a linear layer on top of our frozen features to check that the model is training well. You shouldn't need labels to launch a training of DINOv2, our datasets have labels for evaluation but if you don't have labels you can return something like '0' or 'np.nan'. Look at this file to see what you need for a minimal dataset: https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/datasets/extended.py To make a new dataset without labels, you would have to make a class that inherits ExtendedVisionDataset and create functions for get_image_data (should return an image as an array), get_target (in your case this should return 0 or nan) and __len__ (this should returns the lenght of the dataset). You can take inspiration from the ImageNet dataset class in https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/datasets/image_net.py But keep in mind that you don't need to copy all functions in the ImageNet dataset.

Appreciate your rely, i've launched the training with dataset without labels, by mocking the image_net.py. Thank you!

I've also recently trained a backbone network with my own dataset. Can you share your dataset structure and the changed loading dataset code?

csaroff commented 8 months ago

I have a fork that trains a dinov2 model with an arbitrary unlabeled dataset using skypilot.

https://github.com/csaroff/dinov2/tree/main/sky

Should be easy enough to edit the config to pull down your dataset instead of mine

Only caveat is that resuming training from a checkpoint isn't working(re-running training will overwrite old checkpoints instead of resuming) so you either can't use spot instances, or you'll have to debug that particular issue.

shiyongde commented 3 months ago

same problem

ayushnangia commented 2 months ago

did you figure out how to just pretrain by initializing the weights from the model weights given? @csaroff @shiyongde

PMRS-lab commented 1 week ago

Hello, The training of DINOv2 is fully unsupervised. We don't use labels during the training but we use them during the validation process where we fit a linear layer on top of our frozen features to check that the model is training well. You shouldn't need labels to launch a training of DINOv2, our datasets have labels for evaluation but if you don't have labels you can return something like '0' or 'np.nan'. Look at this file to see what you need for a minimal dataset: https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/datasets/extended.py To make a new dataset without labels, you would have to make a class that inherits ExtendedVisionDataset and create functions for get_image_data (should return an image as an array), get_target (in your case this should return 0 or nan) and __len__ (this should returns the lenght of the dataset). You can take inspiration from the ImageNet dataset class in https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/datasets/image_net.py But keep in mind that you don't need to copy all functions in the ImageNet dataset.

Appreciate your rely, i've launched the training with dataset without labels, by mocking the image_net.py. Thank you!

May I ask how to launch the training with dataset without labels, by mocking the image_net.py. Thank you very much!

PMRS-lab commented 1 week ago

I have a fork that trains a dinov2 model with an arbitrary unlabeled dataset using skypilot.

https://github.com/csaroff/dinov2/tree/main/sky

Should be easy enough to edit the config to pull down your dataset instead of mine

Only caveat is that resuming training from a checkpoint isn't working(re-running training will overwrite old checkpoints instead of resuming) so you either can't use spot instances, or you'll have to debug that particular issue.

May I ask how to launch the training with dataset without labels, by mocking the image_net.py. Thank you very much!

csaroff commented 1 week ago

@PMRS-lab I have instructions for launching training in the README.

In my case, I subclassed ExtendedVisionDataset rather than messing with image_net.py directly.

Really hope this helps. Feel free to create an issue if you have any questions about the fork!

PMRS-lab commented 1 week ago

@PMRS-lab I have instructions for launching training in the README.

In my case, I subclassed ExtendedVisionDataset rather than messing with image_net.py directly.

Really hope this helps. Feel free to create an issue if you have any questions about the fork!

Thank you very much! I replaced your file recursive_image_dataset.py with the original file extended.py and ran the train /run/train/train.py file, but still couldn't train. Can you please explain your changes in detail? I really need to fine tune my dataset to achieve feature extraction of images instead of recognition and classification tasks. Thank you again.

csaroff commented 1 week ago

I replaced your file recursive_image_dataset.py with the original file extended.py and ran the train /run/train/train.py file, but still couldn't train.

You don't need to replace recursive_image_dataset.py. It's referenced in configs/train/vitl14_ep10.yaml which is referenced from the skypilot config.

Did you try launching with the skypilot config based on the README instructions?

Can you please explain your changes in detail?

I subclassed ExtendedVisionDataset based on the instructions above in order train a model without any labels. I also added a skypilot configuration to run training on A100 nodes.

If you're looking for additional information about my changes that isn't already in the README, you're welcome to create a github issue or start a discussion on my fork. Otherwise we risk creating a lot of noise in this thread 📣😃

Hope this helps!