Why acc doesn't change when shift_model training?

ruizewang commented 4 years ago

Hello, When I train shift_lowfps model, the loss decreases slowly, but acc doesn't change (0.500). Could you give me some advice?

Is it because the training time is too short?
Should acc increase when loss decrease? By the way, what does the total loss mean?

[grad norm:][0.0125109516] Iteration 5500, lr = 1e-03, total:loss: 1.246 reg: 0.041 loss:label: 0.705 acc:label: 0.500, time: 2.978 Iteration 5510, lr = 1e-03, total:loss: 1.244 reg: 0.040 loss:label: 0.704 acc:label: 0.500, time: 2.974 Iteration 5520, lr = 1e-03, total:loss: 1.241 reg: 0.039 loss:label: 0.703 acc:label: 0.500, time: 2.953 Iteration 5530, lr = 1e-03, total:loss: 1.239 reg: 0.037 loss:label: 0.702 acc:label: 0.500, time: 2.960 Iteration 5540, lr = 1e-03, total:loss: 1.238 reg: 0.036 loss:label: 0.701 acc:label: 0.500, time: 2.971 Iteration 5550, lr = 1e-03, total:loss: 1.236 reg: 0.035 loss:label: 0.700 acc:label: 0.500, time: 2.965 Iteration 5560, lr = 1e-03, total:loss: 1.234 reg: 0.034 loss:label: 0.700 acc:label: 0.500, time: 2.961 Iteration 5570, lr = 1e-03, total:loss: 1.232 reg: 0.033 loss:label: 0.699 acc:label: 0.500, time: 2.957 Iteration 5580, lr = 1e-03, total:loss: 1.231 reg: 0.032 loss:label: 0.699 acc:label: 0.500, time: 2.952 Iteration 5590, lr = 1e-03, total:loss: 1.229 reg: 0.031 loss:label: 0.698 acc:label: 0.500, time: 2.967 [grad norm:][0.00501754601] Iteration 5600, lr = 1e-03, total:loss: 1.228 reg: 0.030 loss:label: 0.698 acc:label: 0.500, time: 2.968 Iteration 5610, lr = 1e-03, total:loss: 1.227 reg: 0.030 loss:label: 0.697 acc:label: 0.500, time: 2.960 Iteration 5620, lr = 1e-03, total:loss: 1.225 reg: 0.029 loss:label: 0.697 acc:label: 0.500, time: 2.951 Iteration 5630, lr = 1e-03, total:loss: 1.224 reg: 0.028 loss:label: 0.696 acc:label: 0.500, time: 2.977 Iteration 5640, lr = 1e-03, total:loss: 1.223 reg: 0.027 loss:label: 0.696 acc:label: 0.500, time: 2.973 Iteration 5650, lr = 1e-03, total:loss: 1.222 reg: 0.026 loss:label: 0.696 acc:label: 0.500, time: 2.981

andrewowens commented 4 years ago

Yes, this is a common failure mode! The model also takes a long time to get better-than-chance performance, which can make it look like it's stuck.

What batch size are you using? Are you training on AudioSet? Note that I trained that model with 3 GPUs, so the effective batch size was 45.
The loss values that you should probably be looking at are "loss:label", which is the cross-entropy loss, and "acc" which is the overall accuracy. Here, chance performance would be acc = 0.5 and loss:label = ln(0.5) = 0.693. So, it looks like the model has not yet reached chance performance.
In my experiments, the model took something like 2K iterations to reach chance performance (loss:label = 0.693), and 11K iterations to do better than chance (loss:label = 0.692). So, for a long time it looked like the model was stuck at chance.
Did you decrease the learning rate? I trained with lr = 1e-2 at the beginning. This might explain why your model is still doing worse than chance at 5K iterations.

ruizewang commented 4 years ago

Yes, this is a common failure mode! The model also takes a long time to get better-than-chance performance, which can make it look like it's stuck.

What batch size are you using? Are you training on AudioSet? Note that I trained that model with 3 GPUs, so the effective batch size was 45.

The loss values that you should probably be looking at are "loss:label", which is the cross-entropy loss, and "acc" which is the overall accuracy. Here, chance performance would be acc = 0.5 and loss:label = ln(0.5) = 0.693. So, it looks like the model has not yet reached chance performance.

In my experiments, the model took something like 2K iterations to reach chance performance (loss:label = 0.693), and 11K iterations to do better than chance (loss:label = 0.692). So, for a long time it looked like the model was stuck at chance.

Did you decrease the learning rate? I trained with lr = 1e-2 at the beginning. This might explain why your model is still doing worse than chance at 5K iterations.

Thank you very much for your explanation. This makes me suddenly understand.

I use one GPU, 1080Ti, so the batch size is 15. Yes, I am training on Audioset, but the dataset is not as big as in yours which contains 750,000 videos, I use 600,000 videos.

Haha, I see my training process was not going well, so I try Adam with lr=1e-3 (results as mentioned above) . Actually, I tried your default setting, 'momentum' optimizer with lr=1e-2. The result as follows, but it seems worse than chance performance. Maybe, I should train for a longer time.

Iteration 15080, lr = 1e-02, total:loss: 1.257 reg: 0.068 loss:label: 0.693 acc:label: 0.496, time: 7.318
Iteration 15090, lr = 1e-02, total:loss: 1.257 reg: 0.068 loss:label: 0.693 acc:label: 0.496, time: 7.203
[grad norm:][0.0545679964]
Iteration 15100, lr = 1e-02, total:loss: 1.256 reg: 0.068 loss:label: 0.693 acc:label: 0.495, time: 7.167
Iteration 15110, lr = 1e-02, total:loss: 1.255 reg: 0.068 loss:label: 0.693 acc:label: 0.495, time: 7.123
Iteration 15120, lr = 1e-02, total:loss: 1.257 reg: 0.068 loss:label: 0.693 acc:label: 0.497, time: 7.079
Iteration 15130, lr = 1e-02, total:loss: 1.259 reg: 0.068 loss:label: 0.693 acc:label: 0.498, time: 7.032
Iteration 15140, lr = 1e-02, total:loss: 1.260 reg: 0.068 loss:label: 0.693 acc:label: 0.499, time: 6.990
Iteration 15150, lr = 1e-02, total:loss: 1.259 reg: 0.068 loss:label: 0.693 acc:label: 0.498, time: 7.013

Is there any way I can do to speed up training? Because I only have two 1080 Ti GPU. o(╥﹏╥)o

andrewowens commented 4 years ago

I think that two-GPU training might be enough. You could also try to compensate averaging gradients over multiple minibatches to simulate having multiple GPUs, e.g. by using this helper optimizer: https://github.com/renmengye/revnet-public/blob/master/resnet/models/multi_pass_optimizer.py. There's partial support for this in the code already (set multipass = True, and set the number of batches with multipass_count).
I think it is helpful to pay attention to the "loss:label" value, rather than accuracy. In this case, 0.693 loss (which equals ln(0.5)) means that you have chance accuracy. It looks like this model has lower loss than your other model.
I had trouble training with such a large learning rate. I think that you should probably be able to train it with Adam and a 1e-4 learning rate, though.

ruizewang commented 4 years ago

Thanks a lot, Andrew. It is really helpful. 😃

ruizewang commented 4 years ago

Sorry to bother you, I am here again. When a shift model (e.g., 'net.tf-30000') has been trained, how to use this model for testing? Only set "is_training" as False, and run shift_net.train? But I think maybe there is something else I should do.

class Model:
    def __init__(self, pr, sess, gpus, is_training=False, pr_test=None):

ruizewang commented 4 years ago

Hello Andrew.

If I use the pre-trained model you provided as an init model for shift model training, I think the model would do better than chance, i.e., the performance would be acc > 0.5 and loss:label < 0.693, right?

But when I start training from the pre-trained model (net.tf-65000), the performance is strange. At the begining, the performance seems like normal, but later, the "acc" decreases until reaching 0.5 and "loss:label" stucks at 0.693.

[grad norm:][4.99999952]
Iteration 650000, lr = 1e-02, total:loss: 1.339 reg: 0.047 loss:label: 0.692 acc:label: 0.600, time: 42.686
Iteration 650010, lr = 1e-02, total:loss: 1.339 reg: 0.047 loss:label: 0.700 acc:label: 0.592, time: 38.813
Iteration 650020, lr = 1e-02, total:loss: 1.339 reg: 0.047 loss:label: 0.710 acc:label: 0.582, time: 35.310
Iteration 650030, lr = 1e-02, total:loss: 1.332 reg: 0.047 loss:label: 0.712 acc:label: 0.574, time: 32.127
Iteration 650040, lr = 1e-02, total:loss: 1.328 reg: 0.047 loss:label: 0.711 acc:label: 0.570, time: 29.258
Iteration 650050, lr = 1e-02, total:loss: 1.327 reg: 0.047 loss:label: 0.713 acc:label: 0.568, time: 26.660
Iteration 650060, lr = 1e-02, total:loss: 1.321 reg: 0.047 loss:label: 0.714 acc:label: 0.560, time: 24.311
Iteration 650070, lr = 1e-02, total:loss: 1.316 reg: 0.047 loss:label: 0.713 acc:label: 0.556, time: 22.189
Iteration 650080, lr = 1e-02, total:loss: 1.307 reg: 0.047 loss:label: 0.715 acc:label: 0.545, time: 20.270
Iteration 650090, lr = 1e-02, total:loss: 1.303 reg: 0.047 loss:label: 0.714 acc:label: 0.542, time: 18.539

......

Iteration 652000, lr = 1e-02, total:loss: 1.238 reg: 0.047 loss:label: 0.694 acc:label: 0.498, time: 1.845
Iteration 652010, lr = 1e-02, total:loss: 1.239 reg: 0.047 loss:label: 0.694 acc:label: 0.499, time: 1.855
Iteration 652020, lr = 1e-02, total:loss: 1.240 reg: 0.047 loss:label: 0.694 acc:label: 0.500, time: 1.853
Iteration 652030, lr = 1e-02, total:loss: 1.243 reg: 0.047 loss:label: 0.694 acc:label: 0.503, time: 1.857
Iteration 652040, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.857
Iteration 652050, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.857
Iteration 652060, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.693 acc:label: 0.505, time: 1.856
Iteration 652070, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.693 acc:label: 0.505, time: 1.859
Iteration 652080, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.861
Iteration 652090, lr = 1e-02, total:loss: 1.243 reg: 0.047 loss:label: 0.694 acc:label: 0.503, time: 1.859
[grad norm:][0.266301781]
Iteration 652100, lr = 1e-02, total:loss: 1.242 reg: 0.047 loss:label: 0.694 acc:label: 0.502, time: 1.857
Iteration 652110, lr = 1e-02, total:loss: 1.242 reg: 0.047 loss:label: 0.694 acc:label: 0.501, time: 1.862
Iteration 652120, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.694 acc:label: 0.503, time: 1.861
Iteration 652130, lr = 1e-02, total:loss: 1.241 reg: 0.047 loss:label: 0.694 acc:label: 0.501, time: 1.860
Iteration 652140, lr = 1e-02, total:loss: 1.241 reg: 0.047 loss:label: 0.694 acc:label: 0.501, time: 1.856
Iteration 652150, lr = 1e-02, total:loss: 1.242 reg: 0.047 loss:label: 0.694 acc:label: 0.502, time: 1.854
Iteration 652160, lr = 1e-02, total:loss: 1.242 reg: 0.047 loss:label: 0.694 acc:label: 0.502, time: 1.850
Iteration 652170, lr = 1e-02, total:loss: 1.243 reg: 0.047 loss:label: 0.694 acc:label: 0.502, time: 1.848
Iteration 652180, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.846
Iteration 652190, lr = 1e-02, total:loss: 1.243 reg: 0.047 loss:label: 0.693 acc:label: 0.503, time: 1.850
[grad norm:][0.122649804]
Iteration 652200, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.694 acc:label: 0.506, time: 1.848
Iteration 652210, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.694 acc:label: 0.504, time: 1.846
Iteration 652220, lr = 1e-02, total:loss: 1.247 reg: 0.047 loss:label: 0.694 acc:label: 0.507, time: 1.847
Iteration 652230, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.693 acc:label: 0.506, time: 1.849
Iteration 652240, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.694 acc:label: 0.506, time: 1.852
Iteration 652250, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.694 acc:label: 0.506, time: 1.850
Iteration 652260, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.694 acc:label: 0.505, time: 1.852
Iteration 652270, lr = 1e-02, total:loss: 1.245 reg: 0.047 loss:label: 0.693 acc:label: 0.505, time: 1.850
Iteration 652280, lr = 1e-02, total:loss: 1.248 reg: 0.047 loss:label: 0.693 acc:label: 0.508, time: 1.849
Iteration 652290, lr = 1e-02, total:loss: 1.244 reg: 0.047 loss:label: 0.693 acc:label: 0.504, time: 1.846
[grad norm:][0.0564598292]
Iteration 652300, lr = 1e-02, total:loss: 1.246 reg: 0.047 loss:label: 0.693 acc:label: 0.506, time: 1.845

andrewowens commented 4 years ago

Please refer to shift_example.py for an example of testing a trained network.
I think the loss is going up when you fine-tune because you are using a higher learning rate and (especially) a smaller batch size. The model starts out better than chance, but the parameters become worse because it's taking large steps (high learning rate) in not-so-great gradient directions (low batch size).

ruizewang commented 4 years ago

Please refer to shift_example.py for an example of testing a trained network.

I think the loss is going up when you fine-tune because you are using a higher learning rate and (especially) a smaller batch size. The model starts out better than chance, but the parameters become worse because it's taking large steps (high learning rate) in not-so-great gradient directions (low batch size).

Thank you, Andrew. 🤗 Yes, I agree with you. I will try a smaller learning rate and a bigger batch size.
There is an example of generating a cam in shift_example.py. But I saw you reported accuracy in the paper ---"Task performance. We found that the model obtained 59.9% accuracy on held-out videos for its alignment task (chance = 50%)." Actually, I want to test the model and get the accuracy result on the test dataset. Do I need to re-write this part code?

ruizewang commented 4 years ago

There is an example of generating a cam in shift_example.py. But I saw you reported accuracy in the paper ---"Task performance. We found that the model obtained 59.9% accuracy on held-out videos for its alignment task (chance = 50%)." Actually, I want to test the model and get the accuracy result on the test dataset. Do I need to re-write this part code?

This problem is solved. As your suggestion, I add a "test_accuracy" function in "class NetClf". Thanks again, Andrew. 😀

vuthede commented 3 years ago

Hi @ruizewang, would you mind to share the code you use to create data file for training, I would really appreciate that

andrewowens / multisensory

Why acc doesn't change when shift_model training? #31