InvalidArgumentError when evaluating model

ThisCakeIsALie commented 4 years ago

Hey I am getting the following error when trying to evaluate the pretrained model weights. Are there any pointers you could give me, what is going on here?

input> This is a simple test sentence and a function call test().
Traceback (most recent call last):
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
    return fn(*args)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: seq_lens(0) > input.dims(1)
         [[{{node feature/bidirectional_rnn/bw/ReverseSequence}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "evaluate.py", line 77, in <module>
    main()
  File "evaluate.py", line 73, in main
    interactive_shell(model)
  File "evaluate.py", line 40, in interactive_shell
    preds = model.predict(words_raw)
  File "C:\Users\cake\Documents\POSIT\src\tagger\model.py", line 518, in predict
    pred = self.predict_batch([words])
  File "C:\Users\cake\Documents\POSIT\src\tagger\model.py", line 391, in predict_batch
    feed_dict=fd)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
    run_metadata)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: seq_lens(0) > input.dims(1)
         [[node feature/bidirectional_rnn/bw/ReverseSequence (defined at C:\Users\cake\Documents\POSIT\src\tagger\model.py:229) ]]

Errors may have originated from an input operation.
Input Source operations connected to node feature/bidirectional_rnn/bw/ReverseSequence:
 feature/Reshape_1 (defined at C:\Users\cake\Documents\POSIT\src\tagger\model.py:223)
 feature/Cast (defined at C:\Users\cake\Documents\POSIT\src\tagger\model.py:228)

Original stack trace for 'feature/bidirectional_rnn/bw/ReverseSequence':
  File "evaluate.py", line 77, in <module>
    main()
  File "evaluate.py", line 65, in main
    model = restore_model(config)
  File "evaluate.py", line 56, in restore_model
    model.build()
  File "C:\Users\cake\Documents\POSIT\src\tagger\model.py", line 367, in build
    self.add_word_embeddings_op()
  File "C:\Users\cake\Documents\POSIT\src\tagger\model.py", line 229, in add_word_embeddings_op
    sequence_length=feature_sizes, dtype=tf.float32)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\ops\rnn.py", line 493, in bidirectional_dynamic_rnn
    inputs_reverse = nest.map_structure(_map_reverse, inputs)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\util\nest.py", line 515, in map_structure
    structure[0], [func(*x) for x in entries],
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\util\nest.py", line 515, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\ops\rnn.py", line 491, in _map_reverse
    batch_axis=batch_axis)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\ops\rnn.py", line 480, in _reverse
    batch_axis=batch_axis)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\ops\array_ops.py", line 3341, in reverse_sequence
    name=name)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 9566, in reverse_sequence
    seq_dim=seq_dim, batch_dim=batch_dim, name=name)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
    op_def=op_def)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

PPPI commented 4 years ago

Could you double check if you are running tensorflow or tensorflow-gpu? In case you are running tensorflow or tensorflow-gpu in CPU mode, the row-major/column-major format changes.

If this is the reason for the issue (I suspect so, but not sure), then please find line 91 in src/tagger/config.py, it should say

use_cpu = False

Just flip this line over to True.

Sadly, I didn't know of a reliable way to detect if I was running in CPU mode, so it's on the user to flip the flag in the config!

Hope this helps! -Profir

ThisCakeIsALie commented 4 years ago

Thanks for replying so quickly! I was indeed using tensorflow (1.14 to be exact).

Sadly the config change doesn't quite seem to do the trick. The original error vanishes but once again there seems to be some kind of mismatch in dimensions it would seem.

input> This is a test a nd a call to test()
Traceback (most recent call last):
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
    return fn(*args)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [10,1], In[1]: [8,4]
         [[{{node feature/bidirectional_rnn/bw/bw/while/lstm_cell_3/BiasAdd_1}}]]

During handling of the above exception, another exception occurred:

...

I will try setting things up on colab (maybe this is a problem with my machine, who knows) and report back if anything different happens there.

PPPI commented 4 years ago

Reading for this error, it seems to not match the bias term in the LSTM cell. Not particularly sure what is the root cause on this.

For the saved model the config should be correct; however, I did change the config on the branch (and the code there is significantly changed).

Could you sanity check that you are using the code from master? I cannot validate the old model on the branch code before mid/late next week.

If you want to check data integraty:

results.zip: (SHA-256) 597764810B66E13F099216470E0BF4877A6EABBA2A24EDC41B5A6D9ABA771942

Also, not sure if my GH LFS covers enough data download, so just in case, here's the corpora: https://1drv.ms/u/s!AnfFX0y_EVFM1IwNFJDRK5EBV3UK4g?e=jQZie6 (OneDrive is easiest on my end to provide).

For colab, the caveat about CPU applies, make sure to have the flag the right way around depending on if you use a GPU or not.

I also sanity checked on my end (maybe something broke :) ) with the example line:

$ python src/evaluate.py ./results/test/SO_Freq_Id/7dec5e7f-9c9b-4e7b-a52a-cdb6183f83de_with_crf_with_chars_with_features_epochs_30_dropout_0.500_batch_16_opt_rmsprop_lr_0.0100_lrdecay_0.9500/model.weights

(On Windows 10x64 B.2004 with Python 3.7 and Tf-gpu 1.14) as well as training. Both work on the master branch. I cannot give gurantees on the branch as the config format there is significantly changed.

I'll keep an eye here if there are issues on the colab.

PPPI commented 4 years ago

Quick update after investigating some things on my end. The model code was trained before I modeled/allowed UNK in chars. So even once you sort errors you will hit a index[m,n,l] = -1 error. I am pushing the update to revert this on master (still want it on my branch though).

You can also manually fix it on your end by editing src/tagger/data_utils.py:L166 to:

char_ids = vocab_chars.doc2idx(chars_, unknown_word_index=0)

The change is from UNK being -1 to UNK being ignores (i.e. 0).

Hope this helps, and I will keep an eye on the thread if you need other help.

-Profir

ThisCakeIsALie commented 4 years ago

It works! I set my project up on colab and it works with the changes you mentioned (with tensorflow 1.15.2)! Now I just have to figure out what's wrong on my machine (maybe Anaconda messed up some installs?).

But thank you so much for the help. Much appreciated!

Sonic714 commented 2 years ago

Hi. I ran evaluate.py with sample command and got similar issue as mentioned in #3. I tried changing unknown_word_index but it didn't work. I wonder if there is anything I did wrong. The error is as follows. I am using the saved model and the code on master branch with some modifications: changing the CPU flag to true in config.py as well as config.json in the saved model, and inserting sys.path in evaluate.py to avoid no module error. On Windows 10 with Python 3.7 and Tf 1.15.2(version 1.14.0 and 1.15.0 were also tried). Thank you in advance.

Traceback (most recent call last):
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [496,1], In[1]: [8,4]
         [[{{node feature/bidirectional_rnn/fw/fw/while/lstm_cell_2/BiasAdd_3}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/evaluate.py", line 80, in <module>
    main()
  File "src/evaluate.py", line 75, in main
    model.evaluate(test)
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\base_model.py", line 151, in evaluate
    metrics = self.run_evaluate(test)
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 474, in run_evaluate
    labels_pred, labels_pred_l, sequence_lengths = self.predict_batch(words)
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 391, in predict_batch
    feed_dict=fd)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
    run_metadata_ptr)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
    run_metadata)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [496,1], In[1]: [8,4]
         [[node feature/bidirectional_rnn/fw/fw/while/lstm_cell_2/BiasAdd_3 (defined at D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]

Original stack trace for 'feature/bidirectional_rnn/fw/fw/while/lstm_cell_2/BiasAdd_3':
  File "src/evaluate.py", line 80, in <module>
    main()
  File "src/evaluate.py", line 68, in main
    model = restore_model(config)
  File "src/evaluate.py", line 59, in restore_model
    model.build()
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 367, in build
    self.add_word_embeddings_op()
  File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 229, in add_word_embeddings_op
    sequence_length=feature_sizes, dtype=tf.float32)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 464, in bidirectional_dynamic_rnn
    scope=fw_scope)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 707, in dynamic_rnn
    dtype=dtype)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 916, in _dynamic_rnn_loop
    swap_memory=swap_memory)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2753, in while_loop
    return_same_structure)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2245, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2170, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2705, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 882, in _time_step
    skip_conditionals=True)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 283, in _rnn_step
    new_output, new_state = call_cell()
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 870, in <lambda>
    call_cell = lambda: cell(input_t, state)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\keras\layers\recurrent.py", line 2241, in call
    x_o = K.bias_add(x_o, b_o)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\keras\backend.py", line 5442, in bias_add
    x = nn.bias_add(x, bias)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2718, in bias_add
    return gen_nn_ops.bias_add(value, bias, data_format=data_format, name=name)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 760, in bias_add
    "BiasAdd", value=value, bias=bias, data_format=data_format, name=name)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

PPPI commented 2 years ago

Hey,

Sadly a GPU trained model will not work on the CPU and vice-versa. So this: "changing the CPU flag to true in config.py as well as config.json in the saved model," changes the code path taken, while the actual parameters are still GPU-shaped.

The error is exactly what I expect to see in such a scenario.

I assume you need to run on the CPU and don't have time to train a model. I am uncertain how soon I can train a model on the master branch, but if you need a CPU model, I could get to that by the end of March (have other engagements that are saturating my compute and time).

If you can run the GPU model on the GPU, I would suggest going that route instead as that might be quicker than waiting on me to train a fresh CPU model.

Sorry if this answer is a bit disappointing.

For a slightly more technical explanation of what is happening, within the biLSTM library code, some tensors/matrices have different encoding (Row-first/Column-first) depending on the CPU/GPU execution. This is due to what each device expects to see. So a model trained on the other device has data that looks scrambled for the other. Technically, you could try to change from row- to column-first all parameters and rerun the model, but I have not worked that low level in tensorflow, and cannot advice how to do that, nor do I think that's a useful way to waste time ;)

Regards, Profir

Sonic714 commented 2 years ago

Thank you so much for your help! I'll try training a CPU model on my machine. I cannot use GPU because TF 1.15 does not support CUDA 11 and RTX 30 Series GPU.

PPPI commented 2 years ago

Let me know if there are other issues or you get stuck and I'll try to help! Good luck!

PPPI / POSIT

InvalidArgumentError when evaluating model #2