Open GiovanniFaldani opened 1 year ago
Hi, how did you set the dimensions for the embedding layer and how many words are in your vocabulary file (including a word for out-of-vocabulary words)?
Hello,
I fixed it by manually clipping the input tensor to where values over the vocabulary size are set to 0 (the ID of the <unk>
symbol)
All I did was change the embedding_layer function in custom_layers.py as such:
def embedding_layer(inputs, vocab_size, embedding_dim, initializer):
"""Looks up embedding vectors for each k-mer."""
#Need to clip tensor so all values > vocab_size are put to 0
inputs = tf.where(tf.math.greater_equal(inputs, tf.constant([vocab_size], dtype=tf.int64)), tf.zeros_like(inputs), inputs)
embedding_weights = tf.compat.v1.get_variable(name="token_embedding_weights",
shape=[vocab_size, embedding_dim],
initializer=initializer, trainable=True)
return tf.compat.v1.nn.embedding_lookup(embedding_weights, inputs)
Nice work! I have no idea why the ID could exceed the vocabulary size. Sorry about that.
Hello, I am writing because I am trying to train a custom model for DeepMicrobes, and keep getting the same error whenever I try to train the model on the TFrecord I have created.
The stack trace I get is very long, but I believe the key issue is this:
526337 happens to be exactly the size of the vocabulary file I am using, and it is somehow getting an index out of bounds error on its lookup. How could the embedding of a DNA read use a value that's not in the vocabulary?
I have tried this both using the properly installed version of DeepMicrobes with TensorFlow 1.9, and also by porting the code to TensorFlow 2 myself, but both versions get the same error, with the only thing changing with every run being the indices[xx, yy] location at which it goes out of bounds.
Are there any reasons for why this might be happening?