Open notconvergingwtf opened 5 years ago
I just set network.sdu.net_coherent = True
and revise line 579 of sym_heatmap.py to coherent_weight = 0.001
, it seems this nan problem can be solved.
Okay, thanks
Sorry, but how did you manage to figure this out? Seems, that network.sdu.net_coherent = True
stands for leaving only this image transformations, that doesn't affect heatmap? How does this affect accuracy?
I did this following the guides of the origial paper, as in the paper Therefore, we employ the CE loss for Lp-g and the MSE loss for Lp-p, respectively. l is empirically set as 0:001 to guarantee convergence
Big thanks
Hi, its me again. After some training time, here what i have: It doesnt look like overfitting on train, may be some problems with convergence.. Have you met the same problem?
What batch size and lr do you use? You can try different batch size or lr, perhaps it can solve your problem.
Batch size is 16.Lr's are 1e-10 and 2e-6 (on screenshot). Well, as you can see, decreasing lr only delays time till spikes appear
I used batch-size 16 and lr 0.00002 at the first several epochs. The spike did not appear. You can try the following commands:
NETWORK='sdu'
MODELDIR='./model_2d'
mkdir -p "$MODELDIR"
PREFIX="$MODELDIR/$NETWORK"
LOGFILE="$MODELDIR/log_$NETWORK"
CUDA_VISIBLE_DEVICES='0' python -u train.py --network $NETWORK --prefix "$PREFIX" --per-batch-size 16 --lr 0.00002 --lr-step '16000,24000,30000' > "$LOGFILE" 2>&1 &
If this problem still appears, you may check the network parameters in config.py
.
Hi, do you have any suggestions on the next problem: While training sdu(nadam,lr=0.00025), this is the loss on validation test: Different model on the same training data was fine Also, while training, lossvalue=nan starts to appear