MicrobeLab / DeepMicrobes

DeepMicrobes: taxonomic classification for metagenomics with deep learning
https://doi.org/10.1093/nargab/lqaa009
Apache License 2.0
81 stars 21 forks source link

Error training a custom model #27

Open GiovanniFaldani opened 1 year ago

GiovanniFaldani commented 1 year ago

Hello, I am writing because I am trying to train a custom model for DeepMicrobes, and keep getting the same error whenever I try to train the model on the TFrecord I have created.

The stack trace I get is very long, but I believe the key issue is this:

Traceback (most recent call last):

  File ~\anaconda3\lib\site-packages\tensorflow\python\client\session.py:1378 in _do_call
    return fn(*args)

  File ~\anaconda3\lib\site-packages\tensorflow\python\client\session.py:1361 in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,

  File ~\anaconda3\lib\site-packages\tensorflow\python\client\session.py:1454 in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,

InvalidArgumentError: indices[12,53] = 526337 is not in [0, 526337)
     [[{{node token_embedding/embedding_lookup}}]]

526337 happens to be exactly the size of the vocabulary file I am using, and it is somehow getting an index out of bounds error on its lookup. How could the embedding of a DNA read use a value that's not in the vocabulary?

I have tried this both using the properly installed version of DeepMicrobes with TensorFlow 1.9, and also by porting the code to TensorFlow 2 myself, but both versions get the same error, with the only thing changing with every run being the indices[xx, yy] location at which it goes out of bounds.

Are there any reasons for why this might be happening?

MicrobeLab commented 1 year ago

Hi, how did you set the dimensions for the embedding layer and how many words are in your vocabulary file (including a word for out-of-vocabulary words)?

GiovanniFaldani commented 1 year ago

Hello, I fixed it by manually clipping the input tensor to where values over the vocabulary size are set to 0 (the ID of the <unk> symbol)

All I did was change the embedding_layer function in custom_layers.py as such:

def embedding_layer(inputs, vocab_size, embedding_dim, initializer):
    """Looks up embedding vectors for each k-mer."""   

    #Need to clip tensor so all values > vocab_size are put to 0
    inputs = tf.where(tf.math.greater_equal(inputs, tf.constant([vocab_size], dtype=tf.int64)), tf.zeros_like(inputs), inputs)

    embedding_weights = tf.compat.v1.get_variable(name="token_embedding_weights",
                                        shape=[vocab_size, embedding_dim],
                                        initializer=initializer, trainable=True)
    return tf.compat.v1.nn.embedding_lookup(embedding_weights, inputs)
MicrobeLab commented 1 year ago

Nice work! I have no idea why the ID could exceed the vocabulary size. Sorry about that.