how long did u trained the model when get a reasonable result?

FateScript / CenterNet-better

An easy to understand and better performance version of CenterNet

Apache License 2.0

548 stars 104 forks source link

how long did u trained the model when get a reasonable result? #8

Closed lucasjinreal closed 4 years ago

lucasjinreal commented 4 years ago

is this loss normal on 1 GPU?

[03/11 14:38:25 c2.utils.events]: eta: 3 days, 15:40:16  iter: 22020  total_loss: 6.084  loss_cls: 4.101  loss_box_wh: 1.746  loss_center_reg: 0.247  time: 0.4291  data_time: 0.0049  lr: 0.002500  max_mem: 4794M

and here is my solver settings:

 SOLVER=dict(
        OPTIMIZER=dict(
            NAME="SGD",
            BASE_LR=0.0025,
            WEIGHT_DECAY=8e-4,
        ),
        LR_SCHEDULER=dict(
            GAMMA=0.1,
            STEPS=(81000, 108000),
            MAX_ITER=826000,
            WARMUP_ITERS=1000,
        ),
        IMS_PER_BATCH=8,
    ),

So have to training 3 days to get a converge result?

FateScript commented 4 years ago

is this loss normal on 1 GPU?

[03/11 14:38:25 c2.utils.events]: eta: 3 days, 15:40:16  iter: 22020  total_loss: 6.084  loss_cls: 4.101  loss_box_wh: 1.746  loss_center_reg: 0.247  time: 0.4291  data_time: 0.0049  lr: 0.002500  max_mem: 4794M

and here is my solver settings:

 SOLVER=dict(
        OPTIMIZER=dict(
            NAME="SGD",
            BASE_LR=0.0025,
            WEIGHT_DECAY=8e-4,
        ),
        LR_SCHEDULER=dict(
            GAMMA=0.1,
            STEPS=(81000, 108000),
            MAX_ITER=826000,
            WARMUP_ITERS=1000,
        ),
        IMS_PER_BATCH=8,
    ),

So have to training 3 days to get a converge result?

This loss value seems abnormal and I never tried trained CenterNet on 1 GPU.

2w iter with batchsize 8 equals 1.2k iter on 8GPU with batchsize 128. 1200 iter with loss 6 might be correct.

by the way, since you use 1 GPU but not 8 and use a small batchsize, your MAX_ITER and STEPS should multiply 16.

lucasjinreal commented 4 years ago

Yes, 16 OOM on my 12G card. so I am using 8. But does BASE_LR will effect the loss curve too? So does WARMUP_ITERS?

since origin centernet using epoch to change learning rate strategy, I experiemented this version converge much more slow than original centernet, it can achieve a reasonable result over one night on my single GPU, but this one one night and a half day got results like this 89999 iters:

which locate seems normal, but class totally wrong.

I tested your pretrained model, it's correct. Just Don't know how many time it must cost to get a reasonable result.

FateScript commented 4 years ago

Yes, 16 OOM on my 12G card. so I am using 8. But does BASE_LR will effect the loss curve too? So does WARMUP_ITERS?

since origin centernet using epoch to change learning rate strategy, I experiemented this version converge much more slow than original centernet, it can achieve a reasonable result over one night on my single GPU, but this one one night and a half day got results like this 89999 iters:

which locate seems normal, but class totally wrong.

I tested your pretrained model, it's correct. Just Don't know how many time it must cost to get a reasonable result.

On machine with 8 2080Ti, resnet18 requires about about 20h or less, resnet50 1d, resnet101 nearly 1.5d

anyway, I will try 4-gpu and 1-gpu version, if it succeed, I will release them.

lucasjinreal commented 4 years ago

I am currently have some seems right lr params (resnet50) which can narrow down total loss to 2.0. Results seems reasonable so far:

it cost 2 day already and need more iterations I think

Bovey0809 commented 4 years ago

I am currently have some seems right lr params (resnet50) which can narrow down total loss to 2.0. Results seems reasonable so far:

it cost 2 day already and need more iterations I think

Could you please share your parameters on resnet50-backbone resnet? I am struggling with the 6.5 loss