facebookresearch / AttentiveNAS

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"
Other
103 stars 21 forks source link

The supernet appears to be reinitialize during the training process #5

Closed liwei109 closed 2 years ago

liwei109 commented 2 years ago

The supernet appears to be reinitialize during the training process. I have met this issue when I'm running the AlphaNet. The log is as follows: Example 1: [10/09 16:00:53]: Epoch: [4][ 50/312] Time 2.075 ( 2.485) Data 0.000 ( 0.273) Loss 4.9844e+00 (4.9407e+00) Acc@1 17.43 ( 16.29) Acc@5 37.01 ( 35.80) [10/09 16:01:15]: Epoch: [4][ 60/312] Time 2.258 ( 2.431) Data 0.000 ( 0.228) Loss 4.9118e+00 (4.9424e+00) Acc@1 15.50 ( 16.19) Acc@5 34.94 ( 35.68) [10/09 16:01:37]: Epoch: [4][ 70/312] Time 2.368 ( 2.400) Data 0.000 ( 0.196) Loss 6.8941e+00 (5.1301e+00) Acc@1 0.10 ( 14.50) Acc@5 0.81 ( 32.05) [10/09 16:01:59]: Epoch: [4][ 80/312] Time 1.940 ( 2.374) Data 0.000 ( 0.172) Loss 6.8695e+00 (5.3466e+00) Acc@1 0.10 ( 12.73) Acc@5 0.76 ( 28.20)

Example 2: [10/11 08:46:30]: Epoch: [169][170/312] Time 2.279 ( 2.272) Data 0.000 ( 0.082) Loss 3.7633e+00 (3.6145e+00) Acc@1 41.94 ( 43.52) Acc@5 64.28 ( 67.07) [10/11 08:46:53]: Epoch: [169][180/312] Time 2.159 ( 2.270) Data 0.000 ( 0.077) Loss 3.7879e+00 (3.6247e+00) Acc@1 39.58 ( 43.30) Acc@5 63.65 ( 66.86) [10/11 08:47:15]: Epoch: [169][190/312] Time 2.206 ( 2.266) Data 0.000 ( 0.073) Loss 6.7652e+00 (3.6773e+00) Acc@1 0.22 ( 42.50) Acc@5 0.68 ( 65.76) [10/11 08:47:37]: Epoch: [169][200/312] Time 2.339 ( 2.262) Data 0.000 ( 0.069) Loss 6.8340e+00 (3.8188e+00) Acc@1 0.07 ( 40.39) Acc@5 0.44 ( 62.51)

After re-initialization, the supernet gradually fits again if training continues. Is it because of the sandwich rule?

dilinwang820 commented 2 years ago

Hi @liwei9719 I found you're using a slight larger batch size, maybe have a try of the default setting? ie., with a total batch size of 2048?

The training instability is might due to weight-sharing, though I have no concrete answers for you. One hack to unblock you - monitor the training curve closely, if there's a sudden drop, resume the previous checkpoint with a different random seed.