Closed daz10000 closed 1 year ago
Hmm, - I started working through all the GPU errors and I think that might actually be the root cause - at least post installation of cuDNN, it's failing in a different way, though it might just be failing now at GPU setup and not getting to the error above
Hi, sorry for the confusion. This error is due to the incompletion of RNN (expected to be completed in #1081 ). Please give us some times to complete that work. Thank you for your patience. :)
Edit: looks like the PR merged. I'll start by pulling the code and seeing if my example runs now! - thanks.
And apologies for slow reply at my end, but thanks for your quick answer, it at least stopped me going crazy short term. Do you have any advice here - should I test the branch, can I help with testing or completing anything? Would it be wise to switch to different library if I need to get this project done near term :( - I love the overall library, and would be happy to be helpful, with the caveat that I do pretty much exclusively F# nowadays. I did think some more examples for the library and / or tests that exercise the basics would be a nice addition, both to help people starting with it, and to verify the basics. I'll take a look at the PR anyway and see how it's coming along. Thanks again - Darren
Sorry, the RNN hasn't been completed yet, the situation is SimpleRNN, StackedRnnCell, the SimpleRnncell is done, the RNN in Eager mode is done, but there's a little problem in graph mode,you can follow the rnn-dev branch to keep up with the latest developments. And thank you for your interest in TensorFlow.NET.
Without wanting to slow you down, I am trying the rnn-dev
branch and building against the current code. It seems like you're making stready progress and again let me know if I can help with anything. The errors are at least changing. I'm currently hitting this one (I removed GPU support form my code to simplify troubleshooting). I'll search to see if I'm doing anything dumb my end, but in case this is helpful, I wanted to share. I'm using the LSTM model which is probably still not finished I'm guessing.
Edit <ignore last error, that's a library version problem). Currently bumping up against this. Feels like progress! I'm also trying to move to WSL2 / gpu support - probably something dumb I'm doing here but thanks for listening. As best I can tell, the error below is due to how I'm setting up the model - it's not ready for eager execution despite me efforts. I'm not sure I can get it into that form, so I might just have to wait till you get the graph mode kinks worked out. Again let me know if I can help,
Darren
Unhandled exception. Tensorflow.RuntimeError: Attempting to capture an EagerTensor without building a function.
at Tensorflow.ops.convert_to_tensor(Object value, TF_DataType dtype, String name, Boolean as_ref, TF_DataType preferred_dtype, Context ctx) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\ops.cs:line 142
at Tensorflow.OpDefLibrary._apply_op_helper(String op_type_name, String name, Dictionary`2 keywords) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Operations\OpDefLibrary.cs:line 165
at Tensorflow.gen_math_ops.mat_mul(Tensor a, Tensor b, Boolean transpose_a, Boolean transpose_b, String name) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Operations\gen_math_ops.cs:line 4941
at Tensorflow.math_ops.<>c__DisplayClass65_0.<matmul>b__0(NameScope scope) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Operations\math_ops.cs:line 807
at Tensorflow.Binding.tf_with[T](T py, Action`1 action) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Binding.Util.cs:line 199
at Tensorflow.math_ops.matmul(Tensor a, Tensor b, Boolean transpose_a, Boolean transpose_b, Boolean adjoint_a, Boolean adjoint_b, Boolean a_is_sparse, Boolean b_is_sparse, String name) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Operations\math_ops.cs:line 786
at Tensorflow.Keras.Layers.Rnn.LSTMCell.Call(Tensors inputs, Tensors states, Nullable`1 training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Layers\Rnn\LSTMCell.cs:line 165
at Tensorflow.Keras.Engine.Layer.Apply(Tensors inputs, Tensors states, Boolean training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Layer.Apply.cs:line 34
at Tensorflow.Keras.Layers.Rnn.LSTM.<>c__DisplayClass6_0.<Call>b__0(Tensors inputs, Tensors states) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Layers\Rnn\LSTM.cs:line 80
at Tensorflow.Keras.BackendImpl.rnn(Func`3 step_function, Tensors inputs, Tensors initial_states, Boolean go_backwards, Tensor mask, Tensors constants, Boolean unroll, Tensors input_length, Boolean time_major, Boolean zero_output_for_mask, Boolean return_all_outputs) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\BackendImpl.cs:line 731
at Tensorflow.Keras.Layers.Rnn.LSTM.Call(Tensors inputs, Tensors initial_state, Nullable`1 training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Layers\Rnn\LSTM.cs:line 85
at Tensorflow.Keras.Engine.Layer.Apply(Tensors inputs, Tensors states, Boolean training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Layer.Apply.cs:line 34
at Tensorflow.Keras.Layers.Rnn.RNN.Apply(Tensors inputs, Tensors initial_states, Boolean training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Layers\Rnn\RNN.cs:line 408
at Tensorflow.Keras.Engine.Functional.Call(Tensors inputs, Tensors state, Nullable`1 training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Functional.cs:line 352
at Tensorflow.Keras.Engine.Layer.Apply(Tensors inputs, Tensors states, Boolean training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Layer.Apply.cs:line 34
at Tensorflow.Keras.Engine.Model.train_step(DataHandler data_handler, Tensors x, Tensors y) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Model.Train.cs:line 38
at Tensorflow.Keras.Engine.Model.train_step_function(DataHandler data_handler, OwnedIterator iterator) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Model.Train.cs:line 15
at Tensorflow.Keras.Engine.Model.FitInternal(DataHandler data_handler, Int32 epochs, Int32 verbose, List`1 callbackList, Nullable`1 validation_data, Func`3 train_step_func) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Model.Fit.cs:line 259
at Tensorflow.Keras.Engine.Model.fit(NDArray x, NDArray y, Int32 batch_size, Int32 epochs, Int32 verbose, List`1 callbacks, Single validation_split, Nullable`1 validation_data, Boolean shuffle, Int32 initial_epoch, Int32 max_queue_size, Int32 workers, Boolean use_multiprocessing) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Model.Fit.cs:line 72
at simple1.Main.main(String[] argv) in C:\XXX\Program.fs:line 98
I'm not familiar with F#, and I tested in rnn-dev using the C# code below and it worked well:
var vocab = 21;
var input = keras.layers.Input(vocab);
var embedding = tf.keras.layers.Embedding(vocab, output_dim: 16).Apply(input);
var lstm = tf.keras.layers.LSTM(8).Apply(embedding);
var model = keras.Model(input, lstm);
model.compile(optimizer: keras.optimizers.RMSprop(1e-3f), loss:
keras.losses.SparseCategoricalCrossentropy(from_logits: true));
Are you using the latest rnn-dev branch? If it is, I will test F# code again to see if there exist some bug. BTW, most of this error is due to the incorrect dimension of mat_mul from inside or outside code. Thank you for your attention to TensorFlow.NET and your enthusiasm and willingness to help. If you would like to help us, you can try to complete the implementation of the GRU, or refactor some redundant code and some classes that difficult to use.
Thanks again for your patience and quick replies. The recent problems have been during the data fitting stage (once I joined the rnn-dev branch, the null pointer errors went away.
I can confirm the code above works (for what it's worth, here is the F# equivalent - it's almost identical). I will try to build a better example that also exercises the model.fit
phase. Right now I have been running into the problem mentioned in #916 - something odd happens in the embedding stage during data fitting. The input_length is 15, vocab = 21, batch size = 64, so I have a (batch=64 x input_length=15) tensor going into the embedding layer but it looks like the embedding layer is expecting something shaped more like the vocab_size x embedded output size. (see #916 anyway). One question about the above - I might have confused you with my initial example, but should this input line. i.e. if I have 15 token inputs with each token from a 21 member vocabulary, what shape are you expecting for the input?
var input = keras.layers.Input(vocab);
really be
var input = keras.layers.Input(input_length);
module demo1.Main
open type Tensorflow.Binding
open type Tensorflow.KerasApi
let vocab = 21
let input = keras.layers.Input(vocab)
let embedding = tf.keras.layers.Embedding(vocab, output_dim = 16).Apply(input)
let lstm = tf.keras.layers.LSTM(8).Apply(embedding)
let model = keras.Model(input, lstm)
model.compile(
optimizer = keras.optimizers.RMSprop(1e-3f),
loss = keras.losses.SparseCategoricalCrossentropy(from_logits= true))
model.summary()
The new release version has been released, you can update your TensorFlow.NET and TensorFlow.Keras version to use LSTM and RNN. ^_^
Nice! - I can confirm that the model building stages all run smoothly with the latest package. The full example below still blows up on the last line when it tries to fit this toy data, because of the issue in #916, so I can't fully confirm it's all working, but I trust you have it in hand
#r "nuget:FSharp.Data"
#r "nuget:NumSharp"
#r "nuget:SciSharp.TensorFlow.Redist"
#r "nuget:TensorFlow.Keras"
open type Tensorflow.Binding
open type Tensorflow.KerasApi
open Tensorflow
open Tensorflow.NumPy
open Tensorflow.Keras.Layers
let vocab = 21
let messageLength = 15
let layers = LayersApi()
let inputs = layers.Input(messageLength)
let embedding = layers.Embedding(input_dim = vocab, input_length=15,output_dim = 8).Apply(inputs)
let lstm = tf.keras.layers.LSTM(8).Apply(embedding)
let flatten = layers.Flatten().Apply(lstm)
let dense1 = layers.Dense(32, activation = "relu").Apply(flatten)
let dense2 = layers.Dense(1, activation = "sigmoid").Apply(dense1)
let model = keras.Model(inputs, dense2)
model.summary()
model.compile(
// optimizer = keras.optimizers.Adam(),
optimizer = keras.optimizers.RMSprop(),
loss = keras.losses.BinaryCrossentropy(),
metrics = [|"accuracy"|]
)
let samples = 1000
let rng = System.Random()
let input = Array2D.init samples messageLength
(fun j i -> rng.Next(vocab))
let output =
Array.init
samples
(fun j ->
let values = [| for i in 0..messageLength-1 -> input.[j,i] |]
if (values |> Array.map float32|> Array.average )>= 10.0f then 1.0 else 0.0
)
model.fit(np.array input,np.array output,epochs=10,batch_size=32)
let samples = 1000
let rng = System.Random()
let input = Array2D.init samples messageLength
(fun j i -> rng.Next(vocab))
let output =
Array.init
samples
(fun j ->
let values = [| for i in 0..messageLength-1 -> input.[j,i] |]
if (values |> Array.map float32|> Array.average )>= 10.0f then 1.0 else 0.0
)
// this line still fails due to the issue with the embedding shapes in #916
model.fit(np.array input,np.array output,epochs=10,batch_size=32)
> model.fit(np.array input,np.array output,epochs=10,batch_size=32);;
Epoch: 001/010
Tensorflow.InvalidArgumentError: Incompatible shapes: [21,8] vs. [480,8]
at Tensorflow.Eager.EagerRunner.TFE_FastPathExecute(FastPathOpExecInfo op_exec_info)
at Tensorflow.Contexts.Context.ExecEagerAction(String OpType, String Name, ExecuteOpArgs args)
at Tensorflow.Contexts.Context.ExecuteOp(String opType, String name, ExecuteOpArgs args)
at Tensorflow.math_ops.add_v2(Tensor x, Tensor y, String name)
at Tensorflow.Tensor.<>c__DisplayClass380_0`2.<BinaryOpWrapper>b__0(NameScope scope)
at Tensorflow.Tensor.BinaryOpWrapper[Tx,Ty](String name, Tx x, Ty y)
at Tensorflow.Tensor.op_Addition(Tensor lhs, Tensor rhs)
at Tensorflow.Keras.Optimizers.RMSprop._resource_apply_dense(IVariableV1 var, Tensor grad, Dictionary`2 _apply_state)
at Tensorflow.Keras.Optimizers.OptimizerV2.apply_grad_to_update_var(IVariableV1 var, Tensor grad, Dictionary`2 apply_state)
at Tensorflow.Keras.Optimizers.OptimizerV2.<>c__DisplayClass25_1.<_distributed_apply>b__1(NameScope <p0>)
at Tensorflow.Binding.tf_with[T](T py, Action`1 action)
at Tensorflow.Keras.Optimizers.OptimizerV2.<>c__DisplayClass25_0.<_distributed_apply>b__0(NameScope <p0>)
at Tensorflow.Binding.tf_with[T](T py, Action`1 action)
at Tensorflow.Keras.Optimizers.OptimizerV2._distributed_apply(IEnumerable`1 grads_and_vars, String name, Dictionary`2 _apply_state)
at Tensorflow.Keras.Optimizers.OptimizerV2.<>c__DisplayClass20_0.<apply_gradients>b__1(NameScope <p0>)
at Tensorflow.Binding.tf_with[TIn,TOut](TIn py, Func`2 action)
at Tensorflow.Keras.Optimizers.OptimizerV2.apply_gradients(IEnumerable`1 grads_and_vars, String name, Boolean experimental_aggregate_gradients)
at Tensorflow.Keras.Engine.Model._minimize(GradientTape tape, IOptimizer optimizer, Tensor loss, List`1 trainable_variables)
at Tensorflow.Keras.Engine.Model.train_step(DataHandler data_handler, Tensors x, Tensors y)
at Tensorflow.Keras.Engine.Model.train_step_function(DataHandler data_handler, OwnedIterator iterator)
at Tensorflow.Keras.Engine.Model.FitInternal(DataHandler data_handler, Int32 epochs, Int32 verbose, List`1 callbackList, Nullable`1 validation_data, Func`3 train_step_func)
at Tensorflow.Keras.Engine.Model.fit(NDArray x, NDArray y, Int32 batch_size, Int32 epochs, Int32 verbose, List`1 callbacks, Single validation_split, Nullable`1 validation_data, Boolean shuffle, Int32 initial_epoch, Int32 max_queue_size, Int32 workers, Boolean use_multiprocessing)
at <StartupCode$FSI_0046>.$FSI_0046.main@() in c:\XXX\demo1\demo1.fsx:line 86
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Stopped due to error
Thanks for the fix to #916 , I was able to verify the LSTM model isn't blowing up anymore in my test case. Closing this, and thanks again for the hard work.
Description
RNN.cs line 42 gets null pointer referencing
!cell.Built
during model construction.With the preface that I love this library, and it is likely a mistake on my part triggering this problem, a null pointer exception isn't a great user experience, so this can minimally provide better feedback (plus, would like to know what I'm doing wrong here - the docs for TensorNet.Keras aren't that extensive unless I missed something.
Reproduction Steps
This is a code fragment reproducing the issue. Save the code as
example.fsx
and run it from command line asdotnet fsi example.fsx
When run, I see a stream of GPU errors and then the traceback at the bottom.
In order to capture the faulting line number, I built the libraries from the current main branch and ran locally against those to isolate the problem to line 42 of
RNN.cs
. Specifically cell appears to be null so something is going wrong during class setup.Thoughts and help appreciated. p.s. are there decent examples of using the Keras library with Tensor.Net - I am mostly going from python examples but would love to see a few fully worked cases.
Known Workarounds
None I can find so far.
Configuration and Other Information
I am testing on a Win11 machine, from a bash prompt with a Quadro GPU (I don't think that's relevant) and I may not have the CUDA drivers installed properly (also likely not relevant). Dotnet runtime is 7.0.203. The code breaks against both the most current nuget versions of the libraries shown below and also the current main branch of the github repo