Closed lucasjinreal closed 4 years ago
is this loss normal on 1 GPU?
[03/11 14:38:25 c2.utils.events]: eta: 3 days, 15:40:16 iter: 22020 total_loss: 6.084 loss_cls: 4.101 loss_box_wh: 1.746 loss_center_reg: 0.247 time: 0.4291 data_time: 0.0049 lr: 0.002500 max_mem: 4794M
and here is my solver settings:
SOLVER=dict( OPTIMIZER=dict( NAME="SGD", BASE_LR=0.0025, WEIGHT_DECAY=8e-4, ), LR_SCHEDULER=dict( GAMMA=0.1, STEPS=(81000, 108000), MAX_ITER=826000, WARMUP_ITERS=1000, ), IMS_PER_BATCH=8, ),
So have to training 3 days to get a converge result?
This loss value seems abnormal and I never tried trained CenterNet on 1 GPU.
2w iter with batchsize 8 equals 1.2k iter on 8GPU with batchsize 128. 1200 iter with loss 6 might be correct.
by the way, since you use 1 GPU but not 8 and use a small batchsize, your MAX_ITER and STEPS should multiply 16.
Yes, 16 OOM on my 12G card. so I am using 8. But does BASE_LR will effect the loss curve too? So does WARMUP_ITERS?
since origin centernet using epoch to change learning rate strategy, I experiemented this version converge much more slow than original centernet, it can achieve a reasonable result over one night on my single GPU, but this one one night and a half day got results like this 89999 iters:
which locate seems normal, but class totally wrong.
I tested your pretrained model, it's correct. Just Don't know how many time it must cost to get a reasonable result.
Yes, 16 OOM on my 12G card. so I am using 8. But does BASE_LR will effect the loss curve too? So does WARMUP_ITERS?
since origin centernet using epoch to change learning rate strategy, I experiemented this version converge much more slow than original centernet, it can achieve a reasonable result over one night on my single GPU, but this one one night and a half day got results like this 89999 iters:
which locate seems normal, but class totally wrong.
I tested your pretrained model, it's correct. Just Don't know how many time it must cost to get a reasonable result.
On machine with 8 2080Ti, resnet18 requires about about 20h or less, resnet50 1d, resnet101 nearly 1.5d
anyway, I will try 4-gpu and 1-gpu version, if it succeed, I will release them.
I am currently have some seems right lr params (resnet50) which can narrow down total loss to 2.0. Results seems reasonable so far:
it cost 2 day already and need more iterations I think
I am currently have some seems right lr params (resnet50) which can narrow down total loss to 2.0. Results seems reasonable so far:
it cost 2 day already and need more iterations I think
Could you please share your parameters on resnet50-backbone resnet? I am struggling with the 6.5 loss
is this loss normal on 1 GPU?
and here is my solver settings:
So have to training 3 days to get a converge result?