About difference between the epoch and iters

IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.

https://detrex.readthedocs.io/en/latest/

Apache License 2.0

1.97k stars 206 forks source link

About difference between the epoch and iters #126

Closed WesleyZCheng closed 1 year ago

WesleyZCheng commented 1 year ago

Hi, thanks for releasing this repo. I'm the user of the original DINO, DAB-DETR, DN-DETR, and the other DETR-like models. I really like the work your team has done.

however, I have some questions about this work on the epoch. All of us know that on the original DINO models, we can set the epoch like 12,24, or 36, which is easy to understand, and we can set the batch_size like 1,2 or more. For instance, if we have 1000 total pic in my dataset, so the GPU will cope with the 2 pics per time, and in one epoch they will iterate the whole dataset 500 times when batch_size is 2. It's good for me to understand.

In the detrex, We inherited some impl from detectron2, The configs for iteration in detrex are dependence on max_iters train.max_iter = 90000 train.eval_period = 5000 train.log_period = 20 as the code shows, it's hard for me to understand the 90000 500, what do they stand for? maybe I say more clearly that is what relationship between max_iters and epoch or another thing. the second is what is the mean about 1 iters.

If I have 1000pics and I want to train it 12 epoch and the batch_size is 2, how can I set the max_iter?

rentainhe commented 1 year ago

Hi, thanks for releasing this repo. I'm the user of the original DINO, DAB-DETR, DN-DETR, and the other DETR-like models. I really like the work your team has done.

however, I have some questions about this work on the epoch. All of us know that on the original DINO models, we can set the epoch like 12,24, or 36, which is easy to understand, and we can set the batch_size like 1,2 or more. For instance, if we have 1000 total pic in my dataset, so the GPU will cope with the 2 pics per time, and in one epoch they will iterate the whole dataset 500 times when batch_size is 2. It's good for me to understand.

In the detrex, We inherited some impl from detectron2, The configs for iteration in detrex are dependence on max_iters train.max_iter = 90000 train.eval_period = 5000 train.log_period = 20 as the code shows, it's hard for me to understand the 90000 500, what do they stand for? maybe I say more clearly that is what relationship between max_iters and epoch or another thing. the second is what is the mean about 1 iters.

If I have 1000pics and I want to train it 12 epoch and the batch_size is 2, how can I set the max_iter?

Sometimes our configuration system may raise some misunderstanding for the users, because detrex is based on detectron2, which only supports Trainer on Iteration, and our config is a little bit different to the original DINO repo.

The config dataloader.train.total_batch_size means the total training batchsize among all gpus, which means if you set --num-gpus to 2, there will be 8=16/2 samples on each GPU

And train.max_iter means the total training iters, for example, train.max_iter=90000, dataloader.train.total_batch_size=16 means you are training 90k iters and each iter's batch-size is 16, the total training data equals to 90k * 16

If you want to train on 1k pics with 12epochs, which means if you're using batch_size=16 for training, there will be 62.5 iters per epoch, and the total training iters is 12 * 1000 / 16 = 750, and you can modify the scheduler as follows:

def default_coco_scheduler(epochs=12, decay_epochs=11, warmup_epochs=0):
    total_steps_16bs = int(epochs * 62.5)
    decay_steps = int(decay_epochs * 62.5)
    warmup_steps = int(warmup_epochs * 62.5)
    scheduler = L(MultiStepParamScheduler)(
        values=[1.0, 0.1],
        milestones=[decay_steps, total_steps_16bs],
    )
    return L(WarmupParamScheduler)(
        scheduler=scheduler,
        warmup_length=warmup_steps / total_steps_16bs,
        warmup_method="linear",
        warmup_factor=0.001,
    )

lr_multiplier_12ep = default_coco_scheduler()

# set train iters to 750
train.max_iters = 750

rentainhe commented 1 year ago

Hi, thanks for releasing this repo. I'm the user of the original DINO, DAB-DETR, DN-DETR, and the other DETR-like models. I really like the work your team has done.

however, I have some questions about this work on the epoch. All of us know that on the original DINO models, we can set the epoch like 12,24, or 36, which is easy to understand, and we can set the batch_size like 1,2 or more. For instance, if we have 1000 total pic in my dataset, so the GPU will cope with the 2 pics per time, and in one epoch they will iterate the whole dataset 500 times when batch_size is 2. It's good for me to understand.

In the detrex, We inherited some impl from detectron2, The configs for iteration in detrex are dependence on max_iters train.max_iter = 90000 train.eval_period = 5000 train.log_period = 20 as the code shows, it's hard for me to understand the 90000 500, what do they stand for? maybe I say more clearly that is what relationship between max_iters and epoch or another thing. the second is what is the mean about 1 iters.

If I have 1000pics and I want to train it 12 epoch and the batch_size is 2, how can I set the max_iter?

We will add more tutorials about this~ @WesleyZCheng

WesleyZCheng commented 1 year ago

Hi, thanks for releasing this repo. I'm the user of the original DINO, DAB-DETR, DN-DETR, and the other DETR-like models. I really like the work your team has done. however, I have some questions about this work on the epoch. All of us know that on the original DINO models, we can set the epoch like 12,24, or 36, which is easy to understand, and we can set the batch_size like 1,2 or more. For instance, if we have 1000 total pic in my dataset, so the GPU will cope with the 2 pics per time, and in one epoch they will iterate the whole dataset 500 times when batch_size is 2. It's good for me to understand. In the detrex, We inherited some impl from detectron2, The configs for iteration in detrex are dependence on max_iters train.max_iter = 90000 train.eval_period = 5000 train.log_period = 20 as the code shows, it's hard for me to understand the 90000 500, what do they stand for? maybe I say more clearly that is what relationship between max_iters and epoch or another thing. the second is what is the mean about 1 iters. If I have 1000pics and I want to train it 12 epoch and the batch_size is 2, how can I set the max_iter?

We will add more tutorials about this~ @WesleyZCheng

Thanks so much for replying to my question. I will try my best to follow your way and read carefully the file about coco_schedule.py

I'm looking forward to you and your team will release the tutorials about my question. and do excellent work about DETR.

rentainhe commented 1 year ago

Thanks, we will make detrex better for the usage~

WesleyZCheng commented 1 year ago

More questions about the model weight file. We can see in the original DETR code, there are two weight files, the first is CNN weight file,like ResNet or else, the other is the weight we can download at Model Zoo on GitHub repo, like parser.add_argument('--pretrain_model_path', default='J:/code/DINO-PTH/checkpoint0031_5scale.pth', help='load from other checkpoint')
In our detrex work configs, i can find that we release the weight file and model zoo on https://detrex.readthedocs.io/, and i can find a way to change CNN weight like the code `train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" But i can't find where to modify the transformer weight file in ModelZoo, like where to set dino_swin_large_384_4scale_36ep.pth

rentainhe commented 1 year ago

More questions about the model weight file. We can see in the original DETR code, there are two weight files, the first is CNN weight file,like ResNet or else, the other is the weight we can download at Model Zoo on GitHub repo, like parser.add_argument('--pretrain_model_path', default='J:/code/DINO-PTH/checkpoint0031_5scale.pth', help='load from other checkpoint') In our detrex work configs, i can find that we release the weight file and model zoo on https://detrex.readthedocs.io/, and i can find a way to change CNN weight like the code `train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" But i can't find where to modify the transformer weight file in ModelZoo, like where to set dino_swin_large_384_4scale_36ep.pth

All these two things can be done by setting the train.init_checkpoint cfg

this config tells the model which pretrained weight to be loaded

set it to detectron2://ImageNetPretrained/torchvision/R-50.pkl means loading the pretrained R-50 model into DINO, the detectron2 checkpointer will automatically find the same key in the checkpoint and load it. The key in the model but not in the checkpoint will not be loaded. Therefore, if you want to use a pretrained DINO model in detrex, just set train.init_checkpoint=path/to/dino_swin_large_384_4scale_36ep.pth

You can download the pretrained weights in detrex Model Zoo, and specified the train.init_checkpoint to load it~ @WesleyZCheng

WesleyZCheng commented 1 year ago

More questions about the model weight file. We can see in the original DETR code, there are two weight files, the first is CNN weight file,like ResNet or else, the other is the weight we can download at Model Zoo on GitHub repo, like parser.add_argument('--pretrain_model_path', default='J:/code/DINO-PTH/checkpoint0031_5scale.pth', help='load from other checkpoint') In our detrex work configs, i can find that we release the weight file and model zoo on https://detrex.readthedocs.io/, and i can find a way to change CNN weight like the code `train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" But i can't find where to modify the transformer weight file in ModelZoo, like where to set dino_swin_large_384_4scale_36ep.pth

All these two things can be done by setting the train.init_checkpoint cfg

this config tells the model which pretrained weight to be loaded

set it to detectron2://ImageNetPretrained/torchvision/R-50.pkl means loading the pretrained R-50 model into DINO, the detectron2 checkpointer will automatically find the same key in the checkpoint and load it. The key in the model but not in the checkpoint will not be loaded. Therefore, if you want to use a pretrained DINO model in detrex, just set train.init_checkpoint=path/to/dino_swin_large_384_4scale_36ep.pth

You can download the pretrained weights in detrex Model Zoo, and specified the train.init_checkpoint to load it~ @WesleyZCheng

which means the detectron2 will match the key to what it wants, if the key exists in the checkpoint but not in our train model,, the detectron2 will ignore it. If I set up two configs(train.init_checkpoint) with the same name in the same config file, won't the latter in the configs overwrite the former?

train.init_checkpoint=detectron2://ImageNetPretrained/torchvision/R-50.pkl train.init_checkpoint=path/to/dino_swin_large_384_4scale_36ep.pth if I set like this, maybe it will only load the pth file. cause the original repo maybe will load both ResNet weight and the pth file.

WesleyZCheng commented 1 year ago

Sorry about that, I just find that the weight in Model Zoo you release in https://detrex.readthedocs.io has already include the ResNet weight and the Transformer weight.

So, maybe my question will be solved, thanks for you to answer my question patiently. I will continue to contact you with further questions.

rentainhe commented 1 year ago

More questions about the model weight file. We can see in the original DETR code, there are two weight files, the first is CNN weight file,like ResNet or else, the other is the weight we can download at Model Zoo on GitHub repo, like parser.add_argument('--pretrain_model_path', default='J:/code/DINO-PTH/checkpoint0031_5scale.pth', help='load from other checkpoint') In our detrex work configs, i can find that we release the weight file and model zoo on https://detrex.readthedocs.io/, and i can find a way to change CNN weight like the code `train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" But i can't find where to modify the transformer weight file in ModelZoo, like where to set dino_swin_large_384_4scale_36ep.pth

All these two things can be done by setting the train.init_checkpoint cfg this config tells the model which pretrained weight to be loaded set it to detectron2://ImageNetPretrained/torchvision/R-50.pkl means loading the pretrained R-50 model into DINO, the detectron2 checkpointer will automatically find the same key in the checkpoint and load it. The key in the model but not in the checkpoint will not be loaded. Therefore, if you want to use a pretrained DINO model in detrex, just set train.init_checkpoint=path/to/dino_swin_large_384_4scale_36ep.pth You can download the pretrained weights in detrex Model Zoo, and specified the train.init_checkpoint to load it~ @WesleyZCheng

which means the detectron2 will match the key to what it wants, if the key exists in the checkpoint but not in our train model,, the detectron2 will ignore it. If I set up two configs(train.init_checkpoint) with the same name in the same config file, won't the latter in the configs overwrite the former?

train.init_checkpoint=detectron2://ImageNetPretrained/torchvision/R-50.pkl train.init_checkpoint=path/to/dino_swin_large_384_4scale_36ep.pth if I set like this, maybe it will only load the pth file. cause the original repo maybe will load both ResNet weight and the pth file.

Yes, the latter one will overwrite the pre one

In detrex, the config file will be initialized before training, and it's all in python syntax, so if you set train.init_checkpoint=path/to/dino_swin_large_384_4scale_36ep.pth after train.init_checkpoint=detectron2://ImageNetPretrained/torchvision/R-50.pkl, the init_checkpoint will only be the latter one.

rentainhe commented 1 year ago

Sorry about that, I just find that the weight in Model Zoo you release in https://detrex.readthedocs.io has already include the ResNet weight and the Transformer weight.

So, maybe my question will be solved, thanks for you to answer my question patiently. I will continue to contact you with further questions.

Yes, the pretrained weights of DINO including the ResNet backbone

rentainhe commented 1 year ago

Sorry about that, I just find that the weight in Model Zoo you release in https://detrex.readthedocs.io has already include the ResNet weight and the Transformer weight.

So, maybe my question will be solved, thanks for you to answer my question patiently. I will continue to contact you with further questions.

Any other questions you can leave a new issue in detrex~

WesleyZCheng commented 1 year ago

Any other questions you can leave a new issue in detrex~

Sure about that, Hope you and your team will release more exciting work on detrex. I will continue to research detrex.

rentainhe commented 1 year ago

Any other questions you can leave a new issue in detrex~

Sure about that, Hope you and your team will release more exciting work on detrex. I will continue to research detrex.

Thank you~

todesti2 commented 9 months ago

嗨，感谢您发布此存储库。我是原始 DINO、DAB-DETR、DN-DETR 和其他类似 DETR 的模型的用户。我真的很喜欢你们团队所做的工作。但是，我对这个时代的这项工作有一些疑问。我们都知道，在原始的 DINO 模型上，我们可以将纪元设置为 12、24 或 36，这很容易理解，我们可以将batch_size设置为 1,2 或更多。例如，如果我的数据集中总共有 1000 张图片，那么 GPU 每次将处理 2 张图片，并且在一个 epoch 中，当 batch_size 为 2 时，它们将迭代整个数据集 500 次。这对我来说很好理解。 在 detrex 中，我们从 detectron2 继承了一些 impl，detrex 中迭代的配置依赖于代码所示，我很难理解 90000 500，它们代表什么？也许我说得更清楚，那就是max_iters和时代之间的关系，或者其他什么。第二个是 1 iters 的平均值是多少。max_iters``train.max_iter = 90000 train.eval_period = 5000 train.log_period = 20 如果我有 1000 张照片，我想训练它 12 个 epoch，而batch_size是 2，我该如何设置max_iter？

有时我们的配置系统可能会给用户带来一些误解，因为 detrex 是基于的，它只支持 Trainer on ，而我们的配置与原来的 DINO 仓库略有不同。detectron2``Iteration

配置表示所有 GPU 的总训练批处理大小，这意味着如果设置为，则每个 GPU 上都会有样本dataloader.train.total_batch_size``--num-gpus``2``8=16/2

And 表示总训练迭代器，例如，表示您正在训练 90k 个迭代器，每个迭代器的批处理大小为 16，总训练数据等于train.max_iter``train.max_iter=90000``dataloader.train.total_batch_size=16``90k * 16

如果你想用 12 个 epoch 的 1k 个图片进行训练，这意味着如果你用于训练，每个 epoch 会有 62.5 个 iters，总训练 iters 为，你可以修改调度器，如下所示：batch_size=16``12 * 1000 / 16 = 750
def default_coco_scheduler(epochs=12, decay_epochs=11, warmup_epochs=0):
    total_steps_16bs = int(epochs * 62.5)
    decay_steps = int(decay_epochs * 62.5)
    warmup_steps = int(warmup_epochs * 62.5)
    scheduler = L(MultiStepParamScheduler)(
        values=[1.0, 0.1],
        milestones=[decay_steps, total_steps_16bs],
    )
    return L(WarmupParamScheduler)(
        scheduler=scheduler,
        warmup_length=warmup_steps / total_steps_16bs,
        warmup_method="linear",
        warmup_factor=0.001,
    )

lr_multiplier_12ep = default_coco_scheduler()

# set train iters to 750
train.max_iters = 750

您好，total_steps_16bs = int(epochs * 62.5)这里似乎存在问题，其中的62.5=1000/16 表示训练图片数/batch_size 但是我在配置文件中看到的却是这不就和您说的不同吗