Open onvungocminh opened 1 year ago
Hi, I train my own dataset on dinov2 too! My dataset has 4000 images, but I missed ‘’NaN detected‘’ this error,I check my train log and find the koleo_loss is inf alwys, do you have any ideas or advice? Thanks very much!
Yes My dataset contains 20 0000 images of chromosome. The loss didn't decrease for even 0.1 with vit_base of the default config with 50 epochs.(The loss at the start is 15.6255 and after 50 epochs, it's 15.6254)
https://github.com/facebookresearch/dinov2/issues/134#issuecomment-1630981911 By the way, do you guys have this issue?
#134 (comment) By the way, do you guys have this issue? yes! I have the same issue. When I train my own dataset with 1 nodes,change --gpus-per-node is invalid!
#134 (comment) By the way, do you guys have this issue? yes! I have the same issue. When I train my own dataset with 1 nodes,change --gpus-per-node is invalid!
You can use --gres=gpu:4 instead of --gpus-per-node=4
--gres=gpu:4
thanks for your reply! can you give more details? where are this parameter from?
--gres=gpu:4
thanks for your reply! can you give more details? where are this parameter from?
#134 (comment) By the way, do you guys have this issue? yes! I have the same issue. When I train my own dataset with 1 nodes,change --gpus-per-node is invalid!
You can use --gres=gpu:4 instead of --gpus-per-node=4
Thanks for your reply again!But for me,I only have a single machine and multiple gpus, and I cannot apply Slurm
#134 (comment) By the way, do you guys have this issue? yes! I have the same issue. When I train my own dataset with 1 nodes,change --gpus-per-node is invalid!
You can use --gres=gpu:4 instead of --gpus-per-node=4
Thanks for your reply again!But for me,I only have a single machine and multiple gpus, and I cannot apply Slurm
OK, seems that it will be sent to slurm if you run the train.py directly..... I saw torchrun
in https://github.com/facebookresearch/dinov2/issues/134, hopefully it will work?
#134 (comment) By the way, do you guys have this issue? yes! I have the same issue. When I train my own dataset with 1 nodes,change --gpus-per-node is invalid!
You can use --gres=gpu:4 instead of --gpus-per-node=4
Thanks for your reply again!But for me,I only have a single machine and multiple gpus, and I cannot apply Slurm
OK, seems that it will be sent to slurm if you run the train.py directly..... I saw
torchrun
in #134, hopefully it will work?
I see,but for me, it doesn‘t works
Hi authors, I tried DINO with my dataset of 4000000 images of people. But after 30 epochs, the loss function does not decrease anymore. Do you have any idea about that? Do you think that some outlier images ( capture a part of human or overexpose images) could be a problem for DINO to learn? Thank you in advance.
I have the same question. During the initial stages of training, the loss converges quickly, but after a few epochs, it no longer decreases and may even show slight increases.
Hi authors, I tried DINO with my dataset of 4000000 images of people. But after 30 epochs, the loss function does not decrease anymore. Do you have any idea about that? Do you think that some outlier images ( capture a part of human or overexpose images) could be a problem for DINO to learn? Thank you in advance.
I do also have the same question, also, the koleo_loss goes to negative, any suggestions? Thank you!
I do also have the same question, also, the koleo_loss goes to negative, any suggestions? Thank you!
I don't think negative values of the KoLeo loss indicate that anything is wrong. The KoLeo loss, -ln(d), can be negative or positive, depending on whether the distance d is smaller than 1. Regardless of the value of d, gradient descent will incentivize the model to increase d, which is what we want.
I do also have the same question, also, the koleo_loss goes to negative, any suggestions? Thank you!
I don't think negative values of the KoLeo loss indicate that anything is wrong. The KoLeo loss, -ln(d), can be negative or positive, depending on whether the distance d is smaller than 1. Regardless of the value of d, gradient descent will incentivize the model to increase d, which is what we want.
Thank you for your reply! That make scene!
Same here. I tried decreasing the base learning rate and things improved a bit, but still after a few epochs the loss starts increasing instead of decreasing. I am going to try with a few more experiments, but it looks to me that the hyperparam chosen by the authors for the "short" training apply only to ImageNet really. Other datasets probably require different hyperparams.
So the loss values are not supposed to monotonically decrease in DINOv2 (as the EMA teacher is a moving target), the curves plotted above are perfectly valid and look like what we see on our side during training.
However the knn or linear classification performance should increase over time.
@qasfb would you be able to publish some loss curves that you expected to be perfectly valid?
Did you measure the performance of knn and linear classification during the training time to aid with hyper parameter tuning? If yes, could you elaborate a little on that?
Thanks a lot
Sure, here are some curves from an old run that I found; I can't remember what was the exact setup and architecture so please don't worry about the knn/linear performance values or the loss magnitudes, but only the trends:
(The momentum and LR do vary smoothly, the step-like effect is from the log parsing.)
Yes knn performance (mainly) and linear (to a lesser extent) are what we used to tune hyperparams.
@qasfb that is super helpful.
Just for Clarification for knn/linear: You had a:
Total Loss was not really the metric to optimize due to the nature of the DinoV2 algorithm.
Is this correct or did I misunderstood something?
For parameter tuning, ssl_default and the actual ViT_config contain a lot of potential nuts and bolts to tune.
Do you have a list, ideally in order of importance of hyperparameters you would typically tune?
And are the Dino and IBot Head Parameters part of the tuning?
Again thanks a lot
So knn and linear we evaluate on the imagenet-1k dataset [train / val]. We evaluate the teacher EMA model only. For knn, we use only the [cls] token and retrieve the nearest neighbors for classification. For linear we grid over [last cls token, 4 last cls tokens] x [nothing, avgpool of patch tokens] x [grid of learning rates], by training many linear classifiers in parallel using a single model inference.
The rest i think you are correct. The setup is most sensitive to learning rate I would say, we try to stay as high as possible, but lower than instability. We did not tweak dino and ibot head parameters much, we found that >65k prototypes, with separate ibot/dino heads was working well. Anecdotally layerwise learning rate decay can help in some cases. There is an ablation in the paper that maybe you can check.
@qasfb great, thanks for the super quick turn around.
that helps clarify a lot, I guess knn is a lot easier to replicate than the linear part. Unless you would be open to include both in the code base at a later stage.
The ablation is a good read, but at least I could not really find much information in there that would be help me to adapt this to a different dataset, or different domain. Other than helping to better understand the impact of the general design choices. Which I would clearly keep moving forward. The only information I would probably use moving forward is, bigger model for bigger more diverse dataset as a rule of thumb.
And your explanations definitely give me enough confidence and tools at hand to use the model for a different domain, knowing that I essentially just need to look out for learning rate. And just keep everything else as is, while looking at knn loss to understand if the model is learning, and use the other metrics to evaluate the stability of the learning progress.
Thanks heaps. If you have a couple more graphs to share, also cases were the training didn't go well that would be awesome, but this information is already super valuable.
thanks! This is indeed useful information. My training curve look similar to the one you posted, although I am seeing that the performance of the KNN reaches its maximum when the total loss is at its minimum, and then it kind of stays there. However, I am training from scratch the fast
version (the ViT 16 with the fast config), so I might just be hitting the ceiling of what that configuration can do. I will try with the larger model.
Hello, I am trying to use a public pedestrian data luperson to train a vit-base dinov2 pretrain, about 250w pictures, this is my training script, configuration file and loss changes, using a single card 4GPU, I I found that the total loss stagnated when it dropped to about 11. Finally, in the follow-up finetune, I got teacher pretrain and torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14'), the results were similar, and both were better than supervised learning. The result of vit-base imagenet as pretrain is even worse. I don’t know how to configure vit-base to set the lr batchsize. Can you give me some suggestions, looking forward to reply.
shell
torchrun \
--nproc_per_node=4 train.py \
--config-file config.yaml \
--output-dir outdir
config
dino:
head_n_prototypes: 131072
head_bottleneck_dim: 384
ibot:
separate_head: true
head_n_prototypes: 131072
train:
batch_size_per_gpu: 42
dataset_path: /storage/LUPerson
centering: sinkhorn_knopp
student:
arch: vit_base
patch_size: 16
drop_path_rate: 0.4
ffn_layer: mlp
block_chunks: 4
teacher:
momentum_teacher: 0.994
optim:
epochs: 500
weight_decay_end: 0.2
base_lr: 0.4e-04 # learning rate for a batch size of 1024
warmup_epochs: 80
layerwise_decay: 1.0
crops:
global_crops_size:
- 256
- 128
local_crops_size:
- 128
- 64
@BenSpex Thank you for your reply. I then adjusted the learning rate and used a larger batch in multi-level multi-card. At first, the loss can drop as quickly as the result provided by @qasfb, but in the iter after 12500, the loss seems to be unable to converge. Yes, and I encountered "NaN detected" which caused the training to be suspended. I want to know what parameters will cause the loss to surge around 13000. The following are my training parameters and results.
train:
batch_size_per_gpu: 64
dataset_path: LUPerson
student:
arch: vit_base
patch_size: 16
ffn_layer: mlp
block_chunks: 0
optim:
epochs: 200
warmup_epochs: 20
base_lr: 0.002 # learning rate for a batch size of 1024
crops:
global_crops_size:
- 256
- 128
local_crops_size:
- 128
- 64
So I don't have much experience with the Vit-B architecture, but for instabilities I would start by
What happens when you give a list to global_crop_size ? I'm not sure that was covered in our code so it might give unexpected results. Can you maybe check visually what your global crop and local crop inputs look like ? We usually deal only with square crops.
Were you able to evaluate the checkpoint at 12500 iterations by any chance as a sanity check ?
Thank you very much for your reply. Regarding the global_crop_size, I will modify the source code accordingly to ensure that they can support non-square inputs; I will try the hyperparameters you provided, and perform finetune at 12500 iterations to evaluate ReID-related indicators.
@HollrayChan Hi, I'm experiencing similar issues where the loss doesn't converge. I'm wondering if you've managed to resolve your problems. Could you please share any updates or results? thanks!
@HollrayChan Hi, I'm experiencing similar issues where the loss doesn't converge. I'm wondering if you've managed to resolve your problems. Could you please share any updates or results? thanks!
In fact, I made different changes later, but the effect did not get better. I think dinov2 is not better than dinov1 in the task I performed. Maybe dinov2 is not suitable for finetune on this task, because at the beginning I directly Even the official weights have not achieved a better finetune effect.
Hi, I train my own dataset on dinov2 too! My dataset has 4000 images, but I missed ‘’NaN detected‘’ this error,I check my train log and find the koleo_loss is inf alwys, do you have any ideas or advice? Thanks very much!
I missed the same problem(‘’NaN detected‘), have you solved it? Thanks!
For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using vit_base
(single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick.
In dinov2/utils/config.py:25
, I changed this to 64
from my original adjustment to 32
. I was able to use the config defaults as well.
For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using
vit_base
(single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick.In
dinov2/utils/config.py:25
, I changed this to64
from my original adjustment to32
. I was able to use the config defaults as well.
@roboyul What do the 64
and 32
mean in your second paragraph mean exactly? Do you mean change this line to cfg.optim.lr *= math.sqrt(cfg.train.batch_size_per_gpu * distributed.get_global_size() / 64.0)
and run with unchanged learning rate and batchsize 32? Thanks for your idea in advance!
Sure, here are some curves from an old run that I found; I can't remember what was the exact setup and architecture so please don't worry about the knn/linear performance values or the loss magnitudes, but only the trends:
(The momentum and LR do vary smoothly, the step-like effect is from the log parsing.)
Yes knn performance (mainly) and linear (to a lesser extent) are what we used to tune hyperparams.
Hi @qasfb, how can I know if the model is becoming better if I do not have the label to calculate knn performance? I am using a custom dataset to train DINOv2, which only contains unlabeled images. I followed your recommended parameters for ViT-B to avoid the aforementioned NaN
issue but the loss does not decrease now.
One more question, if I use ImageNet, how to log the knn metrics so that I can monitor the model's performance? Thank you in advance for your explanation.
For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using
vit_base
(single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick. Indinov2/utils/config.py:25
, I changed this to64
from my original adjustment to32
. I was able to use the config defaults as well.@roboyul What do the
64
and32
mean in your second paragraph mean exactly? Do you mean change this line tocfg.optim.lr *= math.sqrt(cfg.train.batch_size_per_gpu * distributed.get_global_size() / 64.0)
and run with unchanged learning rate and batchsize 32? Thanks for your idea in advance!
Correct. I had originally attempted to mirror my batch size of 32 with this value (since they originally used a batch size of 1024, I thought this could be a 1:1), however I ran into the "NaN" issue as described. After doubling to 64, I was able to see a trend similar to some of the other successful examples posted here. Verified with the evaluations (progressively higher k-NN, finished around 88%).
For reference, my total training size is small, roughly 16000 images, I modified to patch size to 8, and set the global_crops_size
to 384
as my use case is high-resolution aerial imagery segmentation and benefited from these settings.
For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using
vit_base
(single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick. Indinov2/utils/config.py:25
, I changed this to64
from my original adjustment to32
. I was able to use the config defaults as well.@roboyul What do the
64
and32
mean in your second paragraph mean exactly? Do you mean change this line tocfg.optim.lr *= math.sqrt(cfg.train.batch_size_per_gpu * distributed.get_global_size() / 64.0)
and run with unchanged learning rate and batchsize 32? Thanks for your idea in advance!Correct. I had originally attempted to mirror my batch size of 32 with this value (since they originally used a batch size of 1024, I thought this could be a 1:1), however I ran into the "NaN" issue as described. After doubling to 64, I was able to see a trend similar to some of the other successful examples posted here. Verified with the evaluations (progressively higher k-NN, finished around 88%).
For reference, my total training size is small, roughly 16000 images, I modified to patch size to 8, and set the
global_crops_size
to384
as my use case is high-resolution aerial imagery segmentation and benefited from these settings.
Hi @roboyul, did you have to change the default hard-coded ImageNet mean/std normalization values when training on your own dataset? I wonder if this could be a problem when training on non-ImageNet datasets and lead to unstable loss (assuming training from scratch and not ImageNet pretrained weights)
Although the issue has been a long time, I think the cause of the loss might be from the high temperature of the teacher model. There is a sentence about the constant loss in Appendix D of DINO paper:
When the temperature is higher than 0.06, the training loss consistently converges to ln(K).
the loss seems to be near 10~11, it is close to ln(65536) = 11.090354889. So I recommend to reduce the temperature of the teacher.
thanks! This is indeed useful information. My training curve look similar to the one you posted, although I am seeing that the performance of the KNN reaches its maximum when the total loss is at its minimum, and then it kind of stays there. However, I am training from scratch the
fast
version (the ViT 16 with the fast config), so I might just be hitting the ceiling of what that configuration can do. I will try with the larger model.
Hi, Did you use your own dataset or the imagenet dataset for your pretraining? Thank you so much
I changed all fp16 to fp32, and the NaN problem disappeared.
For those receiving "NaN detected" on small batch sizes on custom data, the instability is most likely related to your learning rate. I also was having this issue on a small batch size of 32 using
vit_base
(single node, single GPU), and I found I was able to stabilize the training process. For me, adjusting the lr scaling to scale slower was the simple trick. Indinov2/utils/config.py:25
, I changed this to64
from my original adjustment to32
. I was able to use the config defaults as well.@roboyul What do the
64
and32
mean in your second paragraph mean exactly? Do you mean change this line tocfg.optim.lr *= math.sqrt(cfg.train.batch_size_per_gpu * distributed.get_global_size() / 64.0)
and run with unchanged learning rate and batchsize 32? Thanks for your idea in advance!Correct. I had originally attempted to mirror my batch size of 32 with this value (since they originally used a batch size of 1024, I thought this could be a 1:1), however I ran into the "NaN" issue as described. After doubling to 64, I was able to see a trend similar to some of the other successful examples posted here. Verified with the evaluations (progressively higher k-NN, finished around 88%).
For reference, my total training size is small, roughly 16000 images, I modified to patch size to 8, and set the
global_crops_size
to384
as my use case is high-resolution aerial imagery segmentation and benefited from these settings.
@roboyul Hi, I'm working on a similar task to the one you shared. I would really appreciate it if you could share the hyperparameters and hardware specifications (especially the GPU) you used for training DINOv2. Thank you!
Hi authors, I tried DINO with my dataset of 4000000 images of people. But after 30 epochs, the loss function does not decrease anymore. Do you have any idea about that? Do you think that some outlier images ( capture a part of human or overexpose images) could be a problem for DINO to learn? Thank you in advance.