Inf in label and predictions using float16

wms2537 commented 5 years ago

The reason why float 16 yields nan is because the number of classes exceeds the range of float16. Thus, some of the labels are inf and cause the network to diverge. Some of the predictions are also inf. Any solutions?

haoxintong commented 5 years ago

Thank you for sharing this. To avoid inf value in label, there are two ways:

Use one hot label to compute loss:

label = nd.array([79998.0, 79997.0], dtype="float32")
label_oh = nd.one_hot(a, depth=80000, on_value=1, off_value=0, dtype="float16")

Multi precision training, I'm not sure about this, I will have a try.

If you have any advice, welcome to share with us. Thanks again!

wms2537 commented 5 years ago

So the arcloss computation can be done by one hot encoding the labels from float32, then compute the loss with float16? Do I need to cast other values to float32?

On Wed, 7 Aug 2019 at 11:08 AM, Maybe notifications@github.com wrote:

Thank you for sharing this. To avoid inf value in label, there are two ways:

Use one hot label to compute loss:

label = nd.array([79998.0, 79997.0], dtype="float32") label_oh = nd.one_hot(a, depth=80000, on_value=1, off_value=0, dtype="float16")

Multi precision training, I'm not sure about this, I will have a try.

If you have any advice, welcome to share with us. Thanks again!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/THUFutureLab/gluon-face/issues/28?email_source=notifications&email_token=AMFW2W4U4KFFOTXRGDMRSO3QDI4DLA5CNFSM4IJWJGH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3XB3IA#issuecomment-518921632, or mute the thread https://github.com/notifications/unsubscribe-auth/AMFW2WYNUVKWZM4NTCBUC73QDI4DLANCNFSM4IJWJGHQ .

haoxintong commented 5 years ago

I think that is OK for compute loss. But there are still some other problems about mixed precision training. I can't give exactly details about it as I have never trained a fp16 face recognition model before.

For best practice, you can check this blog from nvidia developer.

wms2537 commented 5 years ago

According to your recommendation, I've one-hot encoded the labels first and cast then to float16. However, in this line in arcloss: cos_t = F.pick(pred, label, axis=1) The labels need to be 1d right? I changed it to cos_t = F.pick(pred, F.argmax(label, axis=-1), axis=1) but after the argmax operation, it becomes inf again. What I tried is this: cos_t = F.pick(pred, F.argmax(label.astype('float32', copy=False), axis=-1), axis=1).astype(self._dtype, copy=False) but the loss and network output becomes nan after a few iterations. Is there any problems with this? I am training using 1 GPU with batch size of 1024, learning rate 0.1.

wms2537 commented 5 years ago

I tried reducing the learning rate to 1e-3, train loss decreases but train accuracy stays at 0

haoxintong commented 5 years ago

I am not sure if it is caused by precision, as when using float32 to train arcloss, the network output will be NAN at the begining epochs, this loss is not stable. Now I will always do some warm up training before changing loss to arcloss, and this change often happen at the time the network has reached 0.97~0.99 on LFW. So maybe you can use softmax first, and check if it works well or just same as arcloss.

wms2537 commented 5 years ago

I tried but l2softmax loss also diverge to nan

wms2537 commented 5 years ago

Just for your information, I lowered the margin_s and alpha value to 32 for l2softmax and arcloss respectively, the network trains as usual but accuracy reached about 98 percent only. Increasing l2softmax alpha to 48, the network still trains as normal, but Increasing margin_s of arcloss result in immediate divergence after a few epochs. Any ideas if tuning these two values is a solution?

I'm training on deep_glint dataset as I cannot find the emore dataset.

haoxintong commented 5 years ago

The name emore is changed to MS1M-Arcface. For arcloss, I was also troubled with divergence when switching the loss funcion. The model often just got a result of 0.98, while this is never a problem for PistonY's training.

I think the reason is training details, arcloss in this project is modified from early version of insightface, now they simplify the loss by using arccos function and removing some hyper parameters. Besides, there are some other differences like network arch, warmup strategy, data loading pipeline, etc. It's not easy to find out the key factor of training a sota model.

I could hardly help due to my job-hopping, as I dont have enough time and computing resources. So it is recommanded to check related issues of insightface project, there are more discussion about the training details. If you found the best practice, welcome to pull request. :D

wms2537 commented 5 years ago

I've successfully trained arcface to 0.994667+-0.003559 lfw with slowly increasing the margin _m of arcloss and with 10 epochs of l2softmax pretraining. The margin_s i used is 16.

haoxintong commented 5 years ago

Glad to hear that. Thanks for sharing to us!

PistonY commented 4 years ago

@wms2537 Did you successfully use FP16 training this? Please talk about this issue at #38. I'll close this.

THUFutureLab / gluon-face

Inf in label and predictions using float16 #28