Closed ThisCakeIsALie closed 4 years ago
Could you double check if you are running tensorflow
or tensorflow-gpu
?
In case you are running tensorflow
or tensorflow-gpu
in CPU mode, the row-major/column-major format changes.
If this is the reason for the issue (I suspect so, but not sure), then please find line 91 in src/tagger/config.py
, it should say
use_cpu = False
Just flip this line over to True
.
Sadly, I didn't know of a reliable way to detect if I was running in CPU mode, so it's on the user to flip the flag in the config!
Hope this helps! -Profir
Thanks for replying so quickly! I was indeed using tensorflow (1.14 to be exact).
Sadly the config change doesn't quite seem to do the trick. The original error vanishes but once again there seems to be some kind of mismatch in dimensions it would seem.
input> This is a test a nd a call to test()
Traceback (most recent call last):
File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
return fn(*args)
File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\cake\Anaconda3\envs\posit\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [10,1], In[1]: [8,4]
[[{{node feature/bidirectional_rnn/bw/bw/while/lstm_cell_3/BiasAdd_1}}]]
During handling of the above exception, another exception occurred:
...
I will try setting things up on colab (maybe this is a problem with my machine, who knows) and report back if anything different happens there.
Reading for this error, it seems to not match the bias term in the LSTM cell. Not particularly sure what is the root cause on this.
For the saved model the config should be correct; however, I did change the config on the branch (and the code there is significantly changed).
Could you sanity check that you are using the code from master
? I cannot validate the old model on the branch code before mid/late next week.
If you want to check data integraty:
results.zip: (SHA-256) 597764810B66E13F099216470E0BF4877A6EABBA2A24EDC41B5A6D9ABA771942
Also, not sure if my GH LFS covers enough data download, so just in case, here's the corpora: https://1drv.ms/u/s!AnfFX0y_EVFM1IwNFJDRK5EBV3UK4g?e=jQZie6 (OneDrive is easiest on my end to provide).
For colab, the caveat about CPU applies, make sure to have the flag the right way around depending on if you use a GPU or not.
I also sanity checked on my end (maybe something broke :) ) with the example line:
$ python src/evaluate.py ./results/test/SO_Freq_Id/7dec5e7f-9c9b-4e7b-a52a-cdb6183f83de_with_crf_with_chars_with_features_epochs_30_dropout_0.500_batch_16_opt_rmsprop_lr_0.0100_lrdecay_0.9500/model.weights
(On Windows 10x64 B.2004 with Python 3.7 and Tf-gpu 1.14)
as well as training. Both work on the master
branch. I cannot give gurantees on the branch as the config format there is significantly changed.
I'll keep an eye here if there are issues on the colab.
Quick update after investigating some things on my end. The model code was trained before I modeled/allowed UNK in chars. So even once you sort errors you will hit a index[m,n,l] = -1
error. I am pushing the update to revert this on master (still want it on my branch though).
You can also manually fix it on your end by editing src/tagger/data_utils.py:L166
to:
char_ids = vocab_chars.doc2idx(chars_, unknown_word_index=0)
The change is from UNK being -1
to UNK being ignores (i.e. 0
).
Hope this helps, and I will keep an eye on the thread if you need other help.
-Profir
It works! I set my project up on colab and it works with the changes you mentioned (with tensorflow 1.15.2)! Now I just have to figure out what's wrong on my machine (maybe Anaconda messed up some installs?).
But thank you so much for the help. Much appreciated!
Hi. I ran evaluate.py
with sample command and got similar issue as mentioned in #3. I tried changing unknown_word_index
but it didn't work. I wonder if there is anything I did wrong. The error is as follows.
I am using the saved model and the code on master
branch with some modifications: changing the CPU flag to true
in config.py
as well as config.json
in the saved model, and inserting sys.path
in evaluate.py
to avoid no module error. On Windows 10 with Python 3.7 and Tf 1.15.2(version 1.14.0 and 1.15.0 were also tried).
Thank you in advance.
Traceback (most recent call last):
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
target_list, run_metadata)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [496,1], In[1]: [8,4]
[[{{node feature/bidirectional_rnn/fw/fw/while/lstm_cell_2/BiasAdd_3}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/evaluate.py", line 80, in <module>
main()
File "src/evaluate.py", line 75, in main
model.evaluate(test)
File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\base_model.py", line 151, in evaluate
metrics = self.run_evaluate(test)
File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 474, in run_evaluate
labels_pred, labels_pred_l, sequence_lengths = self.predict_batch(words)
File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 391, in predict_batch
feed_dict=fd)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
run_metadata_ptr)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
run_metadata)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [496,1], In[1]: [8,4]
[[node feature/bidirectional_rnn/fw/fw/while/lstm_cell_2/BiasAdd_3 (defined at D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
Original stack trace for 'feature/bidirectional_rnn/fw/fw/while/lstm_cell_2/BiasAdd_3':
File "src/evaluate.py", line 80, in <module>
main()
File "src/evaluate.py", line 68, in main
model = restore_model(config)
File "src/evaluate.py", line 59, in restore_model
model.build()
File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 367, in build
self.add_word_embeddings_op()
File "D:\GitHubWorkspaces\POSIT\src/..\src\tagger\model.py", line 229, in add_word_embeddings_op
sequence_length=feature_sizes, dtype=tf.float32)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 464, in bidirectional_dynamic_rnn
scope=fw_scope)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 707, in dynamic_rnn
dtype=dtype)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 916, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2753, in while_loop
return_same_structure)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2245, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2170, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2705, in <lambda>
body = lambda i, lv: (i + 1, orig_body(*lv))
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 882, in _time_step
skip_conditionals=True)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 283, in _rnn_step
new_output, new_state = call_cell()
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\rnn.py", line 870, in <lambda>
call_cell = lambda: cell(input_t, state)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 854, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\keras\layers\recurrent.py", line 2241, in call
x_o = K.bias_add(x_o, b_o)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\keras\backend.py", line 5442, in bias_add
x = nn.bias_add(x, bias)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2718, in bias_add
return gen_nn_ops.bias_add(value, bias, data_format=data_format, name=name)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 760, in bias_add
"BiasAdd", value=value, bias=bias, data_format=data_format, name=name)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "D:\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
Hey,
Sadly a GPU trained model will not work on the CPU and vice-versa. So this:
"changing the CPU flag to true
in config.py
as well as config.json
in the saved model,"
changes the code path taken, while the actual parameters are still GPU-shaped.
The error is exactly what I expect to see in such a scenario.
I assume you need to run on the CPU and don't have time to train a model. I am uncertain how soon I can train a model on the master branch, but if you need a CPU model, I could get to that by the end of March (have other engagements that are saturating my compute and time).
If you can run the GPU model on the GPU, I would suggest going that route instead as that might be quicker than waiting on me to train a fresh CPU model.
Sorry if this answer is a bit disappointing.
For a slightly more technical explanation of what is happening, within the biLSTM library code, some tensors/matrices have different encoding (Row-first/Column-first) depending on the CPU/GPU execution. This is due to what each device expects to see. So a model trained on the other device has data that looks scrambled for the other. Technically, you could try to change from row- to column-first all parameters and rerun the model, but I have not worked that low level in tensorflow, and cannot advice how to do that, nor do I think that's a useful way to waste time ;)
Regards, Profir
Thank you so much for your help! I'll try training a CPU model on my machine. I cannot use GPU because TF 1.15 does not support CUDA 11 and RTX 30 Series GPU.
Let me know if there are other issues or you get stuck and I'll try to help! Good luck!
Hey I am getting the following error when trying to evaluate the pretrained model weights. Are there any pointers you could give me, what is going on here?