Why we choose 'UNKNOWN' weight's 0.06

analogdevicesinc / ai8x-training

Model Training for ADI's MAX78000 and MAX78002 Edge AI Devices

Apache License 2.0

86 stars 76 forks source link

Why we choose 'UNKNOWN' weight's 0.06 #193

Closed Doruk-Dilmen closed 1 year ago

Doruk-Dilmen commented 1 year ago

Hello,

I am working on KWS. I need to ask something about training.

First of all, How do we choose this value? Does it have any meaning? And in KWS, input shape is (514,64) why it is not (128,128)? Screenshot from 2022-11-24 11-27-41

Second, when I use this script => scripts/train_kws20_v3.sh .Training is happenning like below.

But when I use this arguments:

"args"   : [
                    "--epochs", "50", 
                    "--optimizer", "Adam", 
                    "--lr", "0.005", 
                    "--deterministic", 
                    "--compress", "policies/schedule_kws20.yaml", 
                    "--model", "ai85kws20netv3",                     
                    "--dataset", "KWS_20", 
                    "--confusion", 
                    "--device",  "MAX78000",
                    "--enable-tensorboard",    
                ]

There is no training

Only difference is --wd from arguments, script is using --wd. But why? Can you explain please?

aniktash commented 1 year ago

The weight values are used to balance the effect of different numbers of samples in each class when using cross-entropy loss in training. (https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#crossentropyloss). In particular, the number of samples in the unknown category is much more than the others and this is to scale it down. To be accurate, you can make each value inversely proportional to the number of samples in each class.
Initially, KWS (6 keywords) was trained with MFCCof audio data rather than raw audio data and the output of the MFCC was 512x64. That required implementation of the MFCC on the ARM as well which consumes more power and adds to the delay. The KWS-20 is trained with direct audio samples with 128x128 input shape.

aniktash commented 1 year ago

--wd is the weight decay parameter for SGD optimizer (https://pytorch.org/docs/stable/generated/torch.optim.SGD.html). The pytorch default for the weight decay is 0. However, the training script default is 1e-4 which seems not suitable for training KWS. You may want to initialize it to 0.

github-actions[bot] commented 1 year ago

This issue has been marked stale because it has been open for over 30 days with no activity. It will be closed automatically in 10 days unless a comment is added or the "Stale" label is removed.