I recently met a problem that the training algorithm becomes much slower when the vocabulary size gets extreme large. There is a warning from tensorflow saying that "Converting sparse IndexedSlices to a dense Tensor with 145017088 elements. This may consume a large amount of memory."
I guess Tensorflow is using a dense gradient update on the embedding matrix. Does anyone have any ideas on that?
I recently met a problem that the training algorithm becomes much slower when the vocabulary size gets extreme large. There is a warning from tensorflow saying that "Converting sparse IndexedSlices to a dense Tensor with 145017088 elements. This may consume a large amount of memory."
I guess Tensorflow is using a dense gradient update on the embedding matrix. Does anyone have any ideas on that?
Thanks