analogdevicesinc / ai8x-training

Model Training for ADI's MAX78000 and MAX78002 Edge AI Devices
Apache License 2.0
86 stars 78 forks source link

Using --save-sample during training causes confusion matrix to disappear as well as other output #135

Closed jmenges closed 2 years ago

jmenges commented 2 years ago

Hello,

when using the --save-sample parameter in the training script, the console output seems to be incorrect.

I use the following parameters in my script within the scripts folder.

#!/bin/sh python train.py --epochs 200 --save-sample 15 --confusion --optimizer Adam --lr 0.001 --deterministic --compress schedule_kws20.yaml --model ai85kwslednet --dataset KWS_LED --confusion --device MAX78000 "$@"

This results in the following output during training

...
Epoch: [0][ 1000/ 1016]    Overall Loss 0.957286    Objective Loss 0.957286                                        LR 0.001000    Time 0.848247    
Epoch: [0][ 1010/ 1016]    Overall Loss 0.954101    Objective Loss 0.954101                                        LR 0.001000    Time 0.848627    
Epoch: [0][ 1016/ 1016]    Overall Loss 0.952487    Objective Loss 0.952487    Top1 76.271186    LR 0.001000    Time 0.848052    
--- validate (epoch=0)-----------
28888 samples (256 per mini-batch)
==> Saving sample at index 15 to sample_kws_led.npy
==> Best [Top1: 0.000   Sparsity:0.00   Params: 1280 on epoch: 0]
Saving checkpoint to: logs/2022.03.18-164941/checkpoint.pth.tar

Training epoch: 259997 samples (256 per mini-batch)
Epoch: [1][   10/ 1016]    Overall Loss 0.661458    Objective Loss 0.661458                                        LR 0.001000    Time 1.023104    
Epoch: [1][   20/ 1016]    Overall Loss 0.648343    Objective Loss 0.648343                                        LR 0.001000    Time 0.911043
...

Compared to the output of the same training script, when the --save-sample argument is omitted:

...
Epoch: [0][ 1010/ 1016]    Overall Loss 0.954101    Objective Loss 0.954101                                        LR 0.001000    Time 0.836035    
Epoch: [0][ 1016/ 1016]    Overall Loss 0.952487    Objective Loss 0.952487    Top1 76.271186    LR 0.001000    Time 0.835643    
--- validate (epoch=0)-----------
28888 samples (256 per mini-batch)
Epoch: [0][   10/  113]    Loss 0.664605    Top1 70.507812    
Epoch: [0][   20/  113]    Loss 0.684539    Top1 71.464844    
Epoch: [0][   30/  113]    Loss 0.696856    Top1 71.315104    
Epoch: [0][   40/  113]    Loss 0.694776    Top1 71.152344    
Epoch: [0][   50/  113]    Loss 0.690818    Top1 70.812500    
Epoch: [0][   60/  113]    Loss 0.698156    Top1 70.957031    
Epoch: [0][   70/  113]    Loss 0.693342    Top1 70.987723    
Epoch: [0][   80/  113]    Loss 0.685269    Top1 71.127930    
Epoch: [0][   90/  113]    Loss 0.684183    Top1 71.219618    
Epoch: [0][  100/  113]    Loss 0.685887    Top1 71.183594    
Epoch: [0][  110/  113]    Loss 0.681640    Top1 71.214489    
Epoch: [0][  113/  113]    Loss 0.677707    Top1 71.209499    
==> Top1: 71.209    Loss: 0.678

==> Confusion:
[[  900    50    43    17    44]
 [   84   591   336    10    54]
 [   28   193   756     5    78]
 [   34     5     2   873   123]
 [ 1339   719   710  4443 17451]]

==> Best [Top1: 71.209   Sparsity:0.00   Params: 1280 on epoch: 0]
Saving checkpoint to: logs/2022.03.18-171610/checkpoint.pth.tar

Training epoch: 259997 samples (256 per mini-batch)
Epoch: [1][   10/ 1016]    Overall Loss 0.661458    Objective Loss 0.661458                                        LR 0.001000    Time 1.138955    
Epoch: [1][   20/ 1016]    Overall Loss 0.648343    Objective Loss 0.648343                                        LR 0.001000    Time 0.996905    
Epoch: [1][   30/ 1016]    Overall Loss 0.657448    Objective Loss 0.657448                                        LR 0.001000    Time 0.947590    
Epoch: [1][   40/ 1016]    Overall Loss 0.665805    Objective Loss 0.665805                                        LR 0.001000    Time 0.921748    
...

Is this the expected behaviour?

BR, Jonas

Jake-Carter commented 2 years ago

Hi @jmenges, thank you for reporting. This is unexpected behavior.

I've just opened PR #136 and pending the approval of my more senior colleagues the fix will be merged into the develop branch.

jmenges commented 2 years ago

Hello @Jake-Carter,

thank you for the quick fix.

BR

ryalberti commented 1 year ago

Hello,

I'm training an object-detection model and I'm experiencing a related issue. When I add --save-sample, my mAP goes to 0.0000. Attached (ex 1 , ex 2) are some of my tests. Additionally, I tried the changed train file in PR #136 and I am still receiving a lower mAP than I have in previous continuations. (ex 3). Are there any differences with object detection in this way?

Thanks for the help thus far!

sample_examples.zip