Taining on my computer but the result is almost 0

studabyd commented 3 years ago

Hello, I followed your code and test in your trained model. The result is good as you post. However, I trained by my self, with my trained model the result is so bad, it's almost 0.(your model is 291M and my model is 581M with more INFO:state_dict,optimizer,meta) I upload the results picture by the 2 models and hope to your advise. Whether there is something wrong in the training process? Thank you for your answer. I upload the result and training log file which is trained from the epoch 1 because the sever stop once

2020-12-09 15-45-33屏幕截图 20201101_230002.log

NotF404 commented 3 years ago

hello, it seems that your batch size is not big enough. it's 8 in our case .

studabyd commented 3 years ago

With the limited GPU(2 Titan GPU) resource I trained it with batchsize 4 (per gpu is 2), I think the result will be a bit bad, and it shouldn't be so bad. As we see, the detection result is almost 0%. So I think there is some reason else, and look foward to your help.

kaimingkuang commented 3 years ago

With the limited GPU(2 Titan GPU) resource I trained it with batchsize 4 (per gpu is 2), I think the result will be a bit bad, and it shouldn't be so bad. As we see, the detection result is almost 0%. So I think there is some reason else, and look foward to your help.

When you run model on 2 GPUs with batch size of 4, the BatchNorm layer only calculates statistics of data on each GPU. With that being said, the BatchNorm layer calculates the mean and std of only 2 data points at a time. This in turn results in super volatile statistics in BatchNorm and prevents the model to converge. You might want to use checkpoint and AMP to scale up the batch size.

NotF404 commented 3 years ago

if your titan is 24GB memory, you can try 4 batch size per GPU. I have no idea with limited info you provide.

studabyd commented 3 years ago

With the limited GPU(2 Titan GPU) resource I trained it with batchsize 4 (per gpu is 2), I think the result will be a bit bad, and it shouldn't be so bad. As we see, the detection result is almost 0%. So I think there is some reason else, and look foward to your help.

When you run model on 2 GPUs with batch size of 4, the BatchNorm layer only calculates statistics of data on each GPU. With that being said, the BatchNorm layer calculates the mean and std of only 2 data points at a time. This in turn results in super volatile statistics in BatchNorm and prevents the model to converge. You might want to use checkpoint and AMP to scale up the batch size.

Thank you for your answer. Maybe the small batchsize is the reason that loss dosen't reduce. With 2 GPU, every is 12G, what I could try to make the model training better. I could Accumulation gradient, which means I calculate loss after 2 iteration. but in this way the batchnorm is always apply on 2 batch_per_gpu. So is there any way to reduce the influence of small numbers in the batchnorm process or some way to train it better.

studabyd commented 3 years ago

if your titan is 24GB memory, you can try 4 batch size per GPU. I have no idea with limited info you provide.

My titan is 12G and the total is 24G. I am try to konw whether it can be trained with limited GPUs. Thank you!

kaimingkuang commented 3 years ago

With the limited GPU(2 Titan GPU) resource I trained it with batchsize 4 (per gpu is 2), I think the result will be a bit bad, and it shouldn't be so bad. As we see, the detection result is almost 0%. So I think there is some reason else, and look foward to your help.

When you run model on 2 GPUs with batch size of 4, the BatchNorm layer only calculates statistics of data on each GPU. With that being said, the BatchNorm layer calculates the mean and std of only 2 data points at a time. This in turn results in super volatile statistics in BatchNorm and prevents the model to converge. You might want to use checkpoint and AMP to scale up the batch size.

Thank you for your answer. Maybe the small batchsize is the reason that loss dosen't reduce. With 2 GPU, every is 12G, what I could try to make the model training better. I could Accumulation gradient, which means I calculate loss after 2 iteration. but in this way the batchnorm is always apply on 2 batch_per_gpu. So is there any way to reduce the influence of small numbers in the batchnorm process or some way to train it better.

You could try LayerNorm, InstanceNorm or some other normalization layers that perform better under small batch size setting.

studabyd commented 3 years ago

With the limited GPU(2 Titan GPU) resource I trained it with batchsize 4 (per gpu is 2), I think the result will be a bit bad, and it shouldn't be so bad. As we see, the detection result is almost 0%. So I think there is some reason else, and look foward to your help.

When you run model on 2 GPUs with batch size of 4, the BatchNorm layer only calculates statistics of data on each GPU. With that being said, the BatchNorm layer calculates the mean and std of only 2 data points at a time. This in turn results in super volatile statistics in BatchNorm and prevents the model to converge. You might want to use checkpoint and AMP to scale up the batch size.

Thank you for your answer. Maybe the small batchsize is the reason that loss dosen't reduce. With 2 GPU, every is 12G, what I could try to make the model training better. I could Accumulation gradient, which means I calculate loss after 2 iteration. but in this way the batchnorm is always apply on 2 batch_per_gpu. So is there any way to reduce the influence of small numbers in the batchnorm process or some way to train it better.

You could try LayerNorm, InstanceNorm or some other normalization layers that perform better under small batch size setting.

Ok, I will try it. Thank you.

duducheng commented 3 years ago

When u have limited computational resources, I suggest you should take care of the batch size, checkpointing, normalization (synbn, groupnorm, etc.), model size (use lightweight models), input size (random crop during training). U should also use a pretrained network,.

M3DV / AlignShift

Taining on my computer but the result is almost 0 #4