loss does not converge - Githubissues

onvungocminh commented 1 year ago

Hi authors, I tried DINO with my dataset of 4000000 images of people. But after 30 epochs, the loss function does not decrease anymore. Do you have any idea about that? Do you think that some outlier images ( capture a part of human or overexpose images) could be a problem for DINO to learn? Thank you in advance.

yyyyyyfs commented 1 year ago

Hi, I train my own dataset on dinov2 too! My dataset has 4000 images， but I missed ‘’NaN detected‘’ this error，I check my train log and find the koleo_loss is inf alwys, do you have any ideas or advice? Thanks very much！

EddieAy commented 1 year ago

Yes My dataset contains 20 0000 images of chromosome. The loss didn't decrease for even 0.1 with vit_base of the default config with 50 epochs.(The loss at the start is 15.6255 and after 50 epochs, it's 15.6254)

EddieAy commented 1 year ago

https://github.com/facebookresearch/dinov2/issues/134#issuecomment-1630981911 By the way, do you guys have this issue?

yyyyyyfs commented 1 year ago

#134 (comment) By the way, do you guys have this issue? yes！ I have the same issue. When I train my own dataset with 1 nodes，change --gpus-per-node is invalid！

ambipomyan commented 1 year ago

#134 (comment) By the way, do you guys have this issue? yes！ I have the same issue. When I train my own dataset with 1 nodes，change --gpus-per-node is invalid！

You can use --gres=gpu:4 instead of --gpus-per-node=4

yyyyyyfs commented 1 year ago

--gres=gpu:4

thanks for your reply! can you give more details? where are this parameter from?

ambipomyan commented 1 year ago

--gres=gpu:4

thanks for your reply! can you give more details? where are this parameter from?

https://slurm.schedmd.com/gres.html

yyyyyyfs commented 1 year ago

#134 (comment) By the way, do you guys have this issue? yes！ I have the same issue. When I train my own dataset with 1 nodes，change --gpus-per-node is invalid！

You can use --gres=gpu:4 instead of --gpus-per-node=4

Thanks for your reply again！But for me，I only have a single machine and multiple gpus, and I cannot apply Slurm

ambipomyan commented 1 year ago

#134 (comment) By the way, do you guys have this issue? yes！ I have the same issue. When I train my own dataset with 1 nodes，change --gpus-per-node is invalid！

You can use --gres=gpu:4 instead of --gpus-per-node=4

Thanks for your reply again！But for me，I only have a single machine and multiple gpus, and I cannot apply Slurm

OK, seems that it will be sent to slurm if you run the train.py directly..... I saw torchrun in https://github.com/facebookresearch/dinov2/issues/134, hopefully it will work?

yyyyyyfs commented 1 year ago

#134 (comment) By the way, do you guys have this issue? yes！ I have the same issue. When I train my own dataset with 1 nodes，change --gpus-per-node is invalid！

You can use --gres=gpu:4 instead of --gpus-per-node=4

Thanks for your reply again！But for me，I only have a single machine and multiple gpus, and I cannot apply Slurm

OK, seems that it will be sent to slurm if you run the train.py directly..... I saw torchrun in #134, hopefully it will work?

I see，but for me， it doesn‘t works

echochoc commented 1 year ago

Hi authors, I tried DINO with my dataset of 4000000 images of people. But after 30 epochs, the loss function does not decrease anymore. Do you have any idea about that? Do you think that some outlier images ( capture a part of human or overexpose images) could be a problem for DINO to learn? Thank you in advance.

I have the same question. During the initial stages of training, the loss converges quickly, but after a few epochs, it no longer decreases and may even show slight increases. loss_plot

ambipomyan commented 1 year ago

Hi authors, I tried DINO with my dataset of 4000000 images of people. But after 30 epochs, the loss function does not decrease anymore. Do you have any idea about that? Do you think that some outlier images ( capture a part of human or overexpose images) could be a problem for DINO to learn? Thank you in advance.

I do also have the same question, also, the koleo_loss goes to negative, any suggestions? Thank you!

danphan commented 1 year ago

I do also have the same question, also, the koleo_loss goes to negative, any suggestions? Thank you!

I don't think negative values of the KoLeo loss indicate that anything is wrong. The KoLeo loss, -ln(d), can be negative or positive, depending on whether the distance d is smaller than 1. Regardless of the value of d, gradient descent will incentivize the model to increase d, which is what we want.

ambipomyan commented 1 year ago

I do also have the same question, also, the koleo_loss goes to negative, any suggestions? Thank you!

I don't think negative values of the KoLeo loss indicate that anything is wrong. The KoLeo loss, -ln(d), can be negative or positive, depending on whether the distance d is smaller than 1. Regardless of the value of d, gradient descent will incentivize the model to increase d, which is what we want.

Thank you for your reply! That make scene!

giacomov commented 1 year ago

Same here. I tried decreasing the base learning rate and things improved a bit, but still after a few epochs the loss starts increasing instead of decreasing. I am going to try with a few more experiments, but it looks to me that the hyperparam chosen by the authors for the "short" training apply only to ImageNet really. Other datasets probably require different hyperparams.

qasfb commented 1 year ago

So the loss values are not supposed to monotonically decrease in DINOv2 (as the EMA teacher is a moving target), the curves plotted above are perfectly valid and look like what we see on our side during training.

However the knn or linear classification performance should increase over time.

BenSpex commented 1 year ago

@qasfb would you be able to publish some loss curves that you expected to be perfectly valid?

Did you measure the performance of knn and linear classification during the training time to aid with hyper parameter tuning? If yes, could you elaborate a little on that?

Thanks a lot

qasfb commented 1 year ago

Sure, here are some curves from an old run that I found; I can't remember what was the exact setup and architecture so please don't worry about the knn/linear performance values or the loss magnitudes, but only the trends:

(The momentum and LR do vary smoothly, the step-like effect is from the log parsing.)

Yes knn performance (mainly) and linear (to a lesser extent) are what we used to tune hyperparams.

BenSpex commented 1 year ago

@qasfb that is super helpful.

Just for Clarification for knn/linear: You had a:

set of test images of different classes, known eg. cats and dogs
You ran these through the current status ViT (Student and Teacher would have the same weights at after each epoch/iteration)
Output was your embedding vector x no of patches
This went into Knn where X would be my ViT output and y my classes cats/dogs
In addition through a simple linear classification layer
Accuracy metric is plotted in above chart
Parameter tuning is based on how fast and well the classification of knn/linear performed

Total Loss was not really the metric to optimize due to the nature of the DinoV2 algorithm.

Is this correct or did I misunderstood something?

For parameter tuning, ssl_default and the actual ViT_config contain a lot of potential nuts and bolts to tune.

Do you have a list, ideally in order of importance of hyperparameters you would typically tune?

And are the Dino and IBot Head Parameters part of the tuning?

Again thanks a lot

qasfb commented 1 year ago

So knn and linear we evaluate on the imagenet-1k dataset [train / val]. We evaluate the teacher EMA model only. For knn, we use only the [cls] token and retrieve the nearest neighbors for classification. For linear we grid over [last cls token, 4 last cls tokens] x [nothing, avgpool of patch tokens] x [grid of learning rates], by training many linear classifiers in parallel using a single model inference.

The rest i think you are correct. The setup is most sensitive to learning rate I would say, we try to stay as high as possible, but lower than instability. We did not tweak dino and ibot head parameters much, we found that >65k prototypes, with separate ibot/dino heads was working well. Anecdotally layerwise learning rate decay can help in some cases. There is an ablation in the paper that maybe you can check.

BenSpex commented 1 year ago

@qasfb great, thanks for the super quick turn around.

that helps clarify a lot, I guess knn is a lot easier to replicate than the linear part. Unless you would be open to include both in the code base at a later stage.

The ablation is a good read, but at least I could not really find much information in there that would be help me to adapt this to a different dataset, or different domain. Other than helping to better understand the impact of the general design choices. Which I would clearly keep moving forward. The only information I would probably use moving forward is, bigger model for bigger more diverse dataset as a rule of thumb.

And your explanations definitely give me enough confidence and tools at hand to use the model for a different domain, knowing that I essentially just need to look out for learning rate. And just keep everything else as is, while looking at knn loss to understand if the model is learning, and use the other metrics to evaluate the stability of the learning progress.

Thanks heaps. If you have a couple more graphs to share, also cases were the training didn't go well that would be awesome, but this information is already super valuable.

giacomov commented 1 year ago

thanks! This is indeed useful information. My training curve look similar to the one you posted, although I am seeing that the performance of the KNN reaches its maximum when the total loss is at its minimum, and then it kind of stays there. However, I am training from scratch the fast version (the ViT 16 with the fast config), so I might just be hitting the ceiling of what that configuration can do. I will try with the larger model.

HollrayChan commented 1 year ago

Hello, I am trying to use a public pedestrian data luperson to train a vit-base dinov2 pretrain, about 250w pictures, this is my training script, configuration file and loss changes, using a single card 4GPU, I I found that the total loss stagnated when it dropped to about 11. Finally, in the follow-up finetune, I got teacher pretrain and torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14'), the results were similar, and both were better than supervised learning. The result of vit-base imagenet as pretrain is even worse. I don’t know how to configure vit-base to set the lr batchsize. Can you give me some suggestions, looking forward to reply.

shell
torchrun \
--nproc_per_node=4 train.py \
--config-file config.yaml \
--output-dir outdir

config
dino:
  head_n_prototypes: 131072
  head_bottleneck_dim: 384
ibot:
  separate_head: true
  head_n_prototypes: 131072
train:
  batch_size_per_gpu: 42
  dataset_path: /storage/LUPerson
  centering: sinkhorn_knopp
student:
  arch: vit_base 
  patch_size: 16
  drop_path_rate: 0.4
  ffn_layer: mlp
  block_chunks: 4
teacher:
  momentum_teacher: 0.994
optim:
  epochs: 500
  weight_decay_end: 0.2
  base_lr: 0.4e-04  # learning rate for a batch size of 1024
  warmup_epochs: 80
  layerwise_decay: 1.0
crops:
  global_crops_size: 
  - 256
  - 128
  local_crops_size:
  - 128
  - 64

HollrayChan commented 1 year ago

@BenSpex Thank you for your reply. I then adjusted the learning rate and used a larger batch in multi-level multi-card. At first, the loss can drop as quickly as the result provided by @qasfb, but in the iter after 12500, the loss seems to be unable to converge. Yes, and I encountered "NaN detected" which caused the training to be suspended. I want to know what parameters will cause the loss to surge around 13000. The following are my training parameters and results.

train:
  batch_size_per_gpu: 64
  dataset_path: LUPerson
student:
  arch: vit_base
  patch_size: 16
  ffn_layer: mlp
  block_chunks: 0
optim:
  epochs: 200
  warmup_epochs: 20
  base_lr: 0.002  # learning rate for a batch size of 1024
crops:
  global_crops_size: 
  - 256
  - 128
  local_crops_size:
  - 128
  - 64

qasfb commented 1 year ago

So I don't have much experience with the Vit-B architecture, but for instabilities I would start by

reducing the learning rate (by e.g. a factor 2),
decreasing drop_path_rate to 0.1
decreasing adamw_beta2 to 0.99

What happens when you give a list to global_crop_size ? I'm not sure that was covered in our code so it might give unexpected results. Can you maybe check visually what your global crop and local crop inputs look like ? We usually deal only with square crops.

Were you able to evaluate the checkpoint at 12500 iterations by any chance as a sanity check ?

HollrayChan commented 1 year ago

Thank you very much for your reply. Regarding the global_crop_size, I will modify the source code accordingly to ensure that they can support non-square inputs; I will try the hyperparameters you provided, and perform finetune at 12500 iterations to evaluate ReID-related indicators.

jaayeon commented 1 year ago

@HollrayChan Hi, I'm experiencing similar issues where the loss doesn't converge. I'm wondering if you've managed to resolve your problems. Could you please share any updates or results? thanks!

HollrayChan commented 1 year ago

@HollrayChan Hi, I'm experiencing similar issues where the loss doesn't converge. I'm wondering if you've managed to resolve your problems. Could you please share any updates or results? thanks!

In fact, I made different changes later, but the effect did not get better. I think dinov2 is not better than dinov1 in the task I performed. Maybe dinov2 is not suitable for finetune on this task, because at the beginning I directly Even the official weights have not achieved a better finetune effect.

qgq99 commented 8 months ago

Hi, I train my own dataset on dinov2 too! My dataset has 4000 images， but I missed ‘’NaN detected‘’ this error，I check my train log and find the koleo_loss is inf alwys, do you have any ideas or advice? Thanks very much！

I missed the same problem(‘’NaN detected‘), have you solved it? Thanks!

roboyul commented 8 months ago

For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using vit_base (single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick.

In dinov2/utils/config.py:25, I changed this to 64 from my original adjustment to 32. I was able to use the config defaults as well.

luccachiang commented 8 months ago

For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using vit_base (single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick.

In dinov2/utils/config.py:25, I changed this to 64 from my original adjustment to 32. I was able to use the config defaults as well.

@roboyul What do the 64 and 32 mean in your second paragraph mean exactly? Do you mean change this line to cfg.optim.lr *= math.sqrt(cfg.train.batch_size_per_gpu * distributed.get_global_size() / 64.0) and run with unchanged learning rate and batchsize 32? Thanks for your idea in advance!

luccachiang commented 8 months ago

Sure, here are some curves from an old run that I found; I can't remember what was the exact setup and architecture so please don't worry about the knn/linear performance values or the loss magnitudes, but only the trends:

(The momentum and LR do vary smoothly, the step-like effect is from the log parsing.)

Yes knn performance (mainly) and linear (to a lesser extent) are what we used to tune hyperparams.

Hi @qasfb, how can I know if the model is becoming better if I do not have the label to calculate knn performance? I am using a custom dataset to train DINOv2, which only contains unlabeled images. I followed your recommended parameters for ViT-B to avoid the aforementioned NaN issue but the loss does not decrease now.

One more question, if I use ImageNet, how to log the knn metrics so that I can monitor the model's performance? Thank you in advance for your explanation.

roboyul commented 8 months ago

For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using vit_base (single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick. In dinov2/utils/config.py:25, I changed this to 64 from my original adjustment to 32. I was able to use the config defaults as well.

@roboyul What do the 64 and 32 mean in your second paragraph mean exactly? Do you mean change this line to cfg.optim.lr *= math.sqrt(cfg.train.batch_size_per_gpu * distributed.get_global_size() / 64.0) and run with unchanged learning rate and batchsize 32? Thanks for your idea in advance!

Correct. I had originally attempted to mirror my batch size of 32 with this value (since they originally used a batch size of 1024, I thought this could be a 1:1), however I ran into the "NaN" issue as described. After doubling to 64, I was able to see a trend similar to some of the other successful examples posted here. Verified with the evaluations (progressively higher k-NN, finished around 88%).

For reference, my total training size is small, roughly 16000 images, I modified to patch size to 8, and set the global_crops_size to 384 as my use case is high-resolution aerial imagery segmentation and benefited from these settings.

ahmed1996said commented 6 months ago

For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using vit_base (single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick. In dinov2/utils/config.py:25, I changed this to 64 from my original adjustment to 32. I was able to use the config defaults as well.

@roboyul What do the 64 and 32 mean in your second paragraph mean exactly? Do you mean change this line to cfg.optim.lr *= math.sqrt(cfg.train.batch_size_per_gpu * distributed.get_global_size() / 64.0) and run with unchanged learning rate and batchsize 32? Thanks for your idea in advance!

Correct. I had originally attempted to mirror my batch size of 32 with this value (since they originally used a batch size of 1024, I thought this could be a 1:1), however I ran into the "NaN" issue as described. After doubling to 64, I was able to see a trend similar to some of the other successful examples posted here. Verified with the evaluations (progressively higher k-NN, finished around 88%).

For reference, my total training size is small, roughly 16000 images, I modified to patch size to 8, and set the global_crops_size to 384 as my use case is high-resolution aerial imagery segmentation and benefited from these settings.

Hi @roboyul, did you have to change the default hard-coded ImageNet mean/std normalization values when training on your own dataset? I wonder if this could be a problem when training on non-ImageNet datasets and lead to unstable loss (assuming training from scratch and not ImageNet pretrained weights)

FriedRonaldo commented 4 months ago

Although the issue has been a long time, I think the cause of the loss might be from the high temperature of the teacher model. There is a sentence about the constant loss in Appendix D of DINO paper:

When the temperature is higher than 0.06, the training loss consistently converges to ln(K).

the loss seems to be near 10~11, it is close to ln(65536) = 11.090354889. So I recommend to reduce the temperature of the teacher.

junlinguo commented 3 months ago

thanks! This is indeed useful information. My training curve look similar to the one you posted, although I am seeing that the performance of the KNN reaches its maximum when the total loss is at its minimum, and then it kind of stays there. However, I am training from scratch the fast version (the ViT 16 with the fast config), so I might just be hitting the ceiling of what that configuration can do. I will try with the larger model.

Hi, Did you use your own dataset or the imagenet dataset for your pretraining? Thank you so much

AndrewTal commented 1 week ago

I changed all fp16 to fp32, and the NaN problem disappeared.

whikwon commented 1 week ago

For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using vit_base (single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick. In dinov2/utils/config.py:25, I changed this to 64 from my original adjustment to 32. I was able to use the config defaults as well.

@roboyul What do the 64 and 32 mean in your second paragraph mean exactly? Do you mean change this line to cfg.optim.lr *= math.sqrt(cfg.train.batch_size_per_gpu * distributed.get_global_size() / 64.0) and run with unchanged learning rate and batchsize 32? Thanks for your idea in advance!

Correct. I had originally attempted to mirror my batch size of 32 with this value (since they originally used a batch size of 1024, I thought this could be a 1:1), however I ran into the "NaN" issue as described. After doubling to 64, I was able to see a trend similar to some of the other successful examples posted here. Verified with the evaluations (progressively higher k-NN, finished around 88%).

For reference, my total training size is small, roughly 16000 images, I modified to patch size to 8, and set the global_crops_size to 384 as my use case is high-resolution aerial imagery segmentation and benefited from these settings.

@roboyul Hi, I'm working on a similar task to the one you shared. I would really appreciate it if you could share the hyperparameters and hardware specifications (especially the GPU) you used for training DINOv2. Thank you!

facebookresearch / dinov2

loss does not converge #143