Closed rose-jinyang closed 2 years ago
Hi I tried several times, changing batch size. When batch size is 16, I met the following memory allocation error. I think that there is an issue in calculating adacos loss. Please let me know how to fix this issue. Thanks
Hi When the number of classes is 8631, I found that the shape of theta_class tensor is the following. theta_class: Tensor("ada_cos_1/theta_class:0", shape=(?, 8631, 8631), dtype=float32) Is it right? if so, when batch size is 16, the size of this tensor is greater than 4.7GB
Hi! Thank you for asking. I hope you're all right.
I ran it on Google Colab. The version of Tensorflow at that time was 1.15.
Looking at your logs, it certainly looks like you have enough GPU memory. So, I also don't immediately understand why OOM happens.
In response to your last question, the size of Tensor you asked me is correct. I'm sorry I can't say for sure, but it also seems like the batch size is still large.
Thanks for your quick reply. u did not use the margin parameter m in AdaCos layer. I implemented AdaCos loss as the following
def call(self, inputs):
x, y = inputs
# normalize feature
x = tf.nn.l2_normalize(x, axis=1)
# normalize weights
W = tf.nn.l2_normalize(self.W, axis=0)
# dot product
logits = x @ W
# add margin
# clip logits to prevent zero division when backward
theta = tf.acos(K.clip(logits, -1.0 + K.epsilon(), 1.0 - K.epsilon()))
target_logits = tf.cos(theta + self.m)
output = logits * (1 - y) + target_logits * y
B_avg = tf.where(y < 1, tf.exp(self.s * logits), tf.zeros_like(logits))
B_avg = tf.reduce_mean(tf.reduce_sum(B_avg, axis=1), name='B_avg')
theta_med = tfp.stats.percentile(theta, 50.0, interpolation='midpoint')
with tf.control_dependencies([theta_med, B_avg]):
self.s = tf.math.log(B_avg) / tf.cos(tf.minimum(math.pi / 4, theta_med))
output *= self.s
output = tf.nn.softmax(output)
return output
Please let me know if there is any issue asap. Thanks
Hi I have a question. Did u train the AdaCos model on multi-gpu ? There is an issue when training on multi-gpu? Please let me know how to fix this. Thanks
Hi, thank you for your suggestion of adding margin parameter. Yes, your code is correct.
I can't find out how to explore margin parameter m from original paper, and another implementation also doesn't set this parameter. That's why I don't add it in uploaded code. However, I know this is an essential parameter in the pursuit of greater accuracy.
And, I have not ever trained this model on multi-gpu. But perhaps as you know, Keras official site describes how to do it. https://keras.io/utils/#multi_gpu_model Fixing itself doesn't seem too difficult, but I'm not sure where the trap is lurking. Currently, I don't have multi-gpu environment, but some day I'd like to try :)
Thanks for your reply. I hope that you will test keras-AdaCos version on multi- gpu environment and update github source. I will wait a good result from you. Let's complete this project together. Thanks
i have same problem that memory allocation error occured when I use even 8 batch size. I use 2 gpu which are gtx2080ti. They have 11 vram each. I also use about 9000 classes. I don't understand why it spends so many memory. When I tried with arcFace, it was successfully run with batch 200. I think tf.gather has problem but I don't know how to fix it. plz let me know, if you find out the solution. thanks
@rose-jinyang Hello, is your AdaCos implementation working in Keras? I'm having some problems with the ArcFace implementations that I found in github (had to do some tinkering with the code because I was getting NaN, and the performance still seems to be very different from the pytorch implementations). I can't get models to converge with a large number of classes. I´d like to try with AdaCos if the code to implement the layer is available. Thanks!
Edit: I posted an adaptation in another issue. Didn't try the multi-gpu thing yet, but in a single gpu i'm not getting any memory issue with a batch size of 64 (10500 classes, 130x130x3 images)
Hello Thanks for contributing this project. I met a core dump error in training. I am using tensorflow_gpu 1.14 and keras 2.3.1 My GPUs are Tesla V100 16GB * 4. What is your tensorflow version? Please let me know the reason. Thanks