JaminFong / FNA

Fast Neural Network Adaptation via Parameter Remapping and Architecture Search (ICLR2020 & TPAMI)
Apache License 2.0
164 stars 28 forks source link

Architecture is collapsing during search #15

Closed Hrayo712 closed 3 years ago

Hrayo712 commented 3 years ago

Hello @JaminFong !

I am currently running the architecture adaptation for RetinaNet. However, after the 8 training epochs, every time the model info is being reported (every 1000 iterations), I see that the majority of the architecture becomes skip connections

See below:

2021-01-03 12:27:23,628 - Epoch(train)[9][1000/58625] lr: 0.00200, eta: 3 days, 14:12:43, time: 0.416, data_time: 0.233, memory: 5887, loss_cls: 1.1989, loss_reg: 0.5154, loss: 1.7143, sub_obj: 4.7872 2021-01-03 12:27:23,884 - alpha_weights tensor([[0.1771, 0.1634, 0.1746, 0.1596, 0.1710, 0.1543]], device='cuda:0', grad_fn=) tensor([[0.1483, 0.1398, 0.1457, 0.1356, 0.1422, 0.1300, 0.1585], [0.1483, 0.1398, 0.1457, 0.1356, 0.1422, 0.1300, 0.1585], [0.1483, 0.1398, 0.1457, 0.1356, 0.1421, 0.1300, 0.1585]], device='cuda:0', grad_fn=) tensor([[0.1715, 0.1646, 0.1706, 0.1631, 0.1694, 0.1609]], device='cuda:0', grad_fn=) tensor([[0.1450, 0.1414, 0.1442, 0.1399, 0.1430, 0.1378, 0.1488], [0.1450, 0.1414, 0.1442, 0.1399, 0.1430, 0.1378, 0.1488], [0.1450, 0.1414, 0.1442, 0.1399, 0.1430, 0.1378, 0.1488]], device='cuda:0', grad_fn=) tensor([[0.1688, 0.1655, 0.1685, 0.1649, 0.1681, 0.1642]], device='cuda:0', grad_fn=) tensor([[0.1443, 0.1410, 0.1439, 0.1403, 0.1433, 0.1392, 0.1479], [0.1443, 0.1410, 0.1439, 0.1403, 0.1433, 0.1392, 0.1479], [0.1443, 0.1410, 0.1439, 0.1403, 0.1433, 0.1392, 0.1479]], device='cuda:0', grad_fn=) tensor([[0.1704, 0.1647, 0.1698, 0.1637, 0.1690, 0.1623]], device='cuda:0', grad_fn=) tensor([[0.1455, 0.1384, 0.1449, 0.1373, 0.1440, 0.1358, 0.1540], [0.1455, 0.1384, 0.1449, 0.1373, 0.1440, 0.1358, 0.1540], [0.1455, 0.1384, 0.1449, 0.1373, 0.1440, 0.1358, 0.1540]], device='cuda:0', grad_fn=) tensor([[0.1704, 0.1636, 0.1702, 0.1632, 0.1699, 0.1626]], device='cuda:0', grad_fn=) tensor([[0.1445, 0.1396, 0.1443, 0.1392, 0.1439, 0.1385, 0.1500], [0.1445, 0.1396, 0.1443, 0.1392, 0.1439, 0.1385, 0.1500], [0.1445, 0.1396, 0.1443, 0.1392, 0.1439, 0.1385, 0.1500]], device='cuda:0', grad_fn=) tensor([[0.1723, 0.1621, 0.1720, 0.1615, 0.1714, 0.1606]], device='cuda:0', grad_fn=) tensor([], device='cuda:0', size=(0, 7), grad_fn=) 2021-01-03 12:27:23,891 - [[32, 16], ['k3_e1'], 1]| [[16, 24], ['k3_e3', 'skip', 'skip', 'skip'], 2]| [[24, 32], ['k3_e3', 'skip', 'skip', 'skip'], 2]| [[32, 64], ['k3_e3', 'skip', 'skip', 'skip'], 2]| [[64, 96], ['k3_e3', 'skip', 'skip', 'skip'], 1]| [[96, 160], ['k3_e3', 'skip', 'skip', 'skip'], 2]| [[160, 320], ['k3_e3'], 1]

I am running with the default configuration as it is in the repo. Also, as the seed network for architecture adaptation, I am using the seed network that was provided in the model [zoo] (https://drive.google.com/drive/folders/1XW0NxkLckKQ68s6V7nf7vF4qe1WsL3GE) (seed_mbv2.pt). Furthermore, I am using the coco dataset 2017 as it is in the official website.

I am also running the code with Pytorch 1.1, mmdet 0.6.0 (53c647e), mmcv 0.2.10 and python 3.6.8

Do you know what might be causing this behavior ? or what can I do to fix it ?

Thanks in advance for your help!

JaminFong commented 3 years ago

Hi, this phenomenon is probably due to not using distributed mode. Please follow the suggestion in https://github.com/JaminFong/FNA/issues/14 if needed.

Hrayo712 commented 3 years ago

Ok. I'll give it a go, and let you know. Thanks!

Hrayo712 commented 3 years ago

Hey @JaminFong

It is working now. I was also screwing up on the batch size, so the weights were not sufficiently trained by the time the architecture started to be optimized.

I finished the 14 epochs, but I did not find the same architecture as you did. I suppose variations on the setup might be involved. For instance, I am searching on a single RTX8000 GPU, with batch size 8.

One more question: I see that the current architecture model is reported every 1000 iterations. I was wondering, is the architecture that you report in the paper, the last one found by the script, after the 14 epochs are done ? or did you evaluated several candidates ? if so, how many, and based on what criteria ?

Thanks!

JaminFong commented 3 years ago

Hi, it is better to run the experiment with the same batch size in the paper. A too small batch size will damage the performance. The NAS algorithm may not find the same architecture in each independent run as there are many non-unique solutions in the space, but the performance of the searched architecture is supposed to be similar. By the way, we always evaluate the last architecture found by the script. You could also evaluate more if needed. Hoping my answer could help you!

Hrayo712 commented 3 years ago

Thanks, it really helped. I managed to get a similar architecture with similar performance: 33.8 mAP at 133.07 MAdds

Keep up the great work! Looking forward for FNA++ 👍