Tensor allocation error in calculating adacos loss

rose-jinyang commented 4 years ago

Hello Thanks for contributing this project. I met a core dump error in training. I am using tensorflow_gpu 1.14 and keras 2.3.1 My GPUs are Tesla V100 16GB * 4. What is your tensorflow version? Please let me know the reason. Thanks

rose-jinyang commented 4 years ago

Hi I tried several times, changing batch size. When batch size is 16, I met the following memory allocation error. I think that there is an issue in calculating adacos loss. Please let me know how to fix this issue. Thanks

rose-jinyang commented 4 years ago

Hi When the number of classes is 8631, I found that the shape of theta_class tensor is the following. theta_class: Tensor("ada_cos_1/theta_class:0", shape=(?, 8631, 8631), dtype=float32) Is it right? if so, when batch size is 16, the size of this tensor is greater than 4.7GB

DaisukeIshibe commented 4 years ago

Hi! Thank you for asking. I hope you're all right.

I ran it on Google Colab. The version of Tensorflow at that time was 1.15.

Looking at your logs, it certainly looks like you have enough GPU memory. So, I also don't immediately understand why OOM happens.

In response to your last question, the size of Tensor you asked me is correct. I'm sorry I can't say for sure, but it also seems like the batch size is still large.

rose-jinyang commented 4 years ago

Thanks for your quick reply. u did not use the margin parameter m in AdaCos layer. I implemented AdaCos loss as the following

def call(self, inputs):
    x, y = inputs
    # normalize feature
    x = tf.nn.l2_normalize(x, axis=1)
    # normalize weights
    W = tf.nn.l2_normalize(self.W, axis=0)
    # dot product
    logits = x @ W
    # add margin
    # clip logits to prevent zero division when backward
    theta = tf.acos(K.clip(logits, -1.0 + K.epsilon(), 1.0 - K.epsilon()))
    target_logits = tf.cos(theta + self.m)
    output = logits * (1 - y) + target_logits * y
    B_avg = tf.where(y < 1, tf.exp(self.s * logits), tf.zeros_like(logits))
    B_avg = tf.reduce_mean(tf.reduce_sum(B_avg, axis=1), name='B_avg')
    theta_med = tfp.stats.percentile(theta, 50.0, interpolation='midpoint')
    with tf.control_dependencies([theta_med, B_avg]):
        self.s = tf.math.log(B_avg) / tf.cos(tf.minimum(math.pi / 4, theta_med))
    output *= self.s
    output = tf.nn.softmax(output)
    return output

Please let me know if there is any issue asap. Thanks

rose-jinyang commented 4 years ago

Hi I have a question. Did u train the AdaCos model on multi-gpu ? There is an issue when training on multi-gpu? Please let me know how to fix this. Thanks

DaisukeIshibe commented 4 years ago

Hi, thank you for your suggestion of adding margin parameter. Yes, your code is correct.

I can't find out how to explore margin parameter m from original paper, and another implementation also doesn't set this parameter. That's why I don't add it in uploaded code. However, I know this is an essential parameter in the pursuit of greater accuracy.

And, I have not ever trained this model on multi-gpu. But perhaps as you know, Keras official site describes how to do it. https://keras.io/utils/#multi_gpu_model Fixing itself doesn't seem too difficult, but I'm not sure where the trap is lurking. Currently, I don't have multi-gpu environment, but some day I'd like to try :)

rose-jinyang commented 4 years ago

Thanks for your reply. I hope that you will test keras-AdaCos version on multi- gpu environment and update github source. I will wait a good result from you. Let's complete this project together. Thanks

gihyunkim commented 4 years ago

i have same problem that memory allocation error occured when I use even 8 batch size. I use 2 gpu which are gtx2080ti. They have 11 vram each. I also use about 9000 classes. I don't understand why it spends so many memory. When I tried with arcFace, it was successfully run with batch 200. I think tf.gather has problem but I don't know how to fix it. plz let me know, if you find out the solution. thanks

BRO-HAMMER commented 4 years ago

@rose-jinyang Hello, is your AdaCos implementation working in Keras? I'm having some problems with the ArcFace implementations that I found in github (had to do some tinkering with the code because I was getting NaN, and the performance still seems to be very different from the pytorch implementations). I can't get models to converge with a large number of classes. I´d like to try with AdaCos if the code to implement the layer is available. Thanks!

Edit: I posted an adaptation in another issue. Didn't try the multi-gpu thing yet, but in a single gpu i'm not getting any memory issue with a batch size of 64 (10500 classes, 130x130x3 images)

DaisukeIshibe / Keras-Adacos

Tensor allocation error in calculating adacos loss #1