deepinx / deep-face-alignment

The MXNet Implementation of Stacked Hourglass and Stacked SAT for Robust 2D and 3D Face Alignment
145 stars 44 forks source link

spikes #1

Open notconvergingwtf opened 5 years ago

notconvergingwtf commented 5 years ago

Hi, do you have any suggestions on the next problem: While training sdu(nadam,lr=0.00025), this is the loss on validation test: image Different model on the same training data was fine Also, while training, lossvalue=nan starts to appear

deepinx commented 5 years ago

I just set network.sdu.net_coherent = True and revise line 579 of sym_heatmap.py to coherent_weight = 0.001, it seems this nan problem can be solved.

notconvergingwtf commented 5 years ago

Okay, thanks Sorry, but how did you manage to figure this out? Seems, that network.sdu.net_coherent = True stands for leaving only this image transformations, that doesn't affect heatmap? How does this affect accuracy?

deepinx commented 5 years ago

I did this following the guides of the origial paper, as in the paper Therefore, we employ the CE loss for Lp-g and the MSE loss for Lp-p, respectively. l is empirically set as 0:001 to guarantee convergence

notconvergingwtf commented 5 years ago

Big thanks

notconvergingwtf commented 5 years ago

Hi, its me again. After some training time, here what i have: image It doesnt look like overfitting on train, may be some problems with convergence.. Have you met the same problem?

deepinx commented 5 years ago

What batch size and lr do you use? You can try different batch size or lr, perhaps it can solve your problem.

notconvergingwtf commented 5 years ago

Batch size is 16.Lr's are 1e-10 and 2e-6 (on screenshot). Well, as you can see, decreasing lr only delays time till spikes appear

deepinx commented 5 years ago

I used batch-size 16 and lr 0.00002 at the first several epochs. The spike did not appear. You can try the following commands:

NETWORK='sdu'
MODELDIR='./model_2d'
mkdir -p "$MODELDIR"
PREFIX="$MODELDIR/$NETWORK"
LOGFILE="$MODELDIR/log_$NETWORK"

CUDA_VISIBLE_DEVICES='0' python -u train.py --network $NETWORK --prefix "$PREFIX" --per-batch-size 16 --lr 0.00002 --lr-step '16000,24000,30000' > "$LOGFILE" 2>&1 &

If this problem still appears, you may check the network parameters in config.py.