SciSharp / TensorFlow.NET

.NET Standard bindings for Google's TensorFlow for developing, training and deploying Machine Learning models in C# and F#.
https://scisharp.github.io/tensorflow-net-docs
Apache License 2.0
3.2k stars 514 forks source link

[BUG Report]: LSTM/RNN model throws Null Pointer Exception #1082

Closed daz10000 closed 1 year ago

daz10000 commented 1 year ago

Description

RNN.cs line 42 gets null pointer referencing !cell.Built during model construction.

With the preface that I love this library, and it is likely a mistake on my part triggering this problem, a null pointer exception isn't a great user experience, so this can minimally provide better feedback (plus, would like to know what I'm doing wrong here - the docs for TensorNet.Keras aren't that extensive unless I missed something.

Reproduction Steps

This is a code fragment reproducing the issue. Save the code as example.fsx and run it from command line as dotnet fsi example.fsx

#r "nuget:TensorFlow.Net" 
#r "nuget:TensorFlow.Keras" 
#r "nuget:NumSharp"
#r "nuget:SciSharp.TensorFlow.Redist-Windows-GPU"

open type Tensorflow.KerasApi
open Tensorflow.Keras.Layers

//let test () =
let layers = LayersApi()
let vocab = 21
let inputs = layers.Input(vocab)
let embedding = layers.Embedding(input_dim = vocab, output_dim = 16).Apply(inputs)
// This next step throws NULLReferenceException
let lstm = layers.LSTM(units = 8).Apply(embedding)

When run, I see a stream of GPU errors and then the traceback at the bottom.


$ dotnet fsi t.fsx
2023-05-22 18:31:17.709978: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2023-05-22 18:31:17.710099: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-05-22 18:31:17.949514: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-22 18:31:17.991281: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2023-05-22 18:31:17.992939: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cublas64_11.dll'; dlerror: cublas64_11.dll not found
2023-05-22 18:31:17.994341: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cublasLt64_11.dll'; dlerror: cublasLt64_11.dll not found
2023-05-22 18:31:17.995936: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
2023-05-22 18:31:17.997305: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
2023-05-22 18:31:17.998769: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cusolver64_11.dll'; dlerror: cusolver64_11.dll not found
2023-05-22 18:31:18.000300: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cusparse64_11.dll'; dlerror: cusparse64_11.dll not found
2023-05-22 18:31:18.001689: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found
2023-05-22 18:31:18.001723: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
System.NullReferenceException: Object reference not set to an instance of an object.
   at Tensorflow.Keras.Layers.Rnn.RNN.build(KerasShapesWrapper input_shape)
   at Tensorflow.Keras.Engine.Layer.MaybeBuild(Tensors inputs)
   at Tensorflow.Keras.Engine.Layer.FunctionalConstructionCall(Tensors inputs)
   at Tensorflow.Keras.Engine.Layer.Apply(Tensors inputs, Tensor state, Boolean training)
   at <StartupCode$FSI_0002>.$FSI_0002.main@() in C:\f\BioLoomics\ML\simple1\t.fsx:line 15
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Stopped due to error

In order to capture the faulting line number, I built the libraries from the current main branch and ran locally against those to isolate the problem to line 42 of RNN.cs. Specifically cell appears to be null so something is going wrong during class setup.

 public override void build(KerasShapesWrapper input_shape)
        {
            if (!cell.Built) // Line 42,  cell is null
            {
                cell.build(input_shape);
            }
        }

Thoughts and help appreciated. p.s. are there decent examples of using the Keras library with Tensor.Net - I am mostly going from python examples but would love to see a few fully worked cases.

Known Workarounds

None I can find so far.

Configuration and Other Information

I am testing on a Win11 machine, from a bash prompt with a Quadro GPU (I don't think that's relevant) and I may not have the CUDA drivers installed properly (also likely not relevant). Dotnet runtime is 7.0.203. The code breaks against both the most current nuget versions of the libraries shown below and also the current main branch of the github repo

daz10000 commented 1 year ago

Hmm, - I started working through all the GPU errors and I think that might actually be the root cause - at least post installation of cuDNN, it's failing in a different way, though it might just be failing now at GPU setup and not getting to the error above

AsakusaRinne commented 1 year ago

Hi, sorry for the confusion. This error is due to the incompletion of RNN (expected to be completed in #1081 ). Please give us some times to complete that work. Thank you for your patience. :)

daz10000 commented 1 year ago

Edit: looks like the PR merged. I'll start by pulling the code and seeing if my example runs now! - thanks.

And apologies for slow reply at my end, but thanks for your quick answer, it at least stopped me going crazy short term. Do you have any advice here - should I test the branch, can I help with testing or completing anything? Would it be wise to switch to different library if I need to get this project done near term :( - I love the overall library, and would be happy to be helpful, with the caveat that I do pretty much exclusively F# nowadays. I did think some more examples for the library and / or tests that exercise the basics would be a nice addition, both to help people starting with it, and to verify the basics. I'll take a look at the PR anyway and see how it's coming along. Thanks again - Darren

Wanglongzhi2001 commented 1 year ago

Sorry, the RNN hasn't been completed yet, the situation is SimpleRNN, StackedRnnCell, the SimpleRnncell is done, the RNN in Eager mode is done, but there's a little problem in graph mode,you can follow the rnn-dev branch to keep up with the latest developments. And thank you for your interest in TensorFlow.NET.

daz10000 commented 1 year ago

Without wanting to slow you down, I am trying the rnn-dev branch and building against the current code. It seems like you're making stready progress and again let me know if I can help with anything. The errors are at least changing. I'm currently hitting this one (I removed GPU support form my code to simplify troubleshooting). I'll search to see if I'm doing anything dumb my end, but in case this is helpful, I wanted to share. I'm using the LSTM model which is probably still not finished I'm guessing.

Edit <ignore last error, that's a library version problem). Currently bumping up against this. Feels like progress! I'm also trying to move to WSL2 / gpu support - probably something dumb I'm doing here but thanks for listening. As best I can tell, the error below is due to how I'm setting up the model - it's not ready for eager execution despite me efforts. I'm not sure I can get it into that form, so I might just have to wait till you get the graph mode kinks worked out. Again let me know if I can help,

Darren

Unhandled exception. Tensorflow.RuntimeError: Attempting to capture an EagerTensor without building a function.
   at Tensorflow.ops.convert_to_tensor(Object value, TF_DataType dtype, String name, Boolean as_ref, TF_DataType preferred_dtype, Context ctx) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\ops.cs:line 142
   at Tensorflow.OpDefLibrary._apply_op_helper(String op_type_name, String name, Dictionary`2 keywords) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Operations\OpDefLibrary.cs:line 165
   at Tensorflow.gen_math_ops.mat_mul(Tensor a, Tensor b, Boolean transpose_a, Boolean transpose_b, String name) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Operations\gen_math_ops.cs:line 4941
   at Tensorflow.math_ops.<>c__DisplayClass65_0.<matmul>b__0(NameScope scope) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Operations\math_ops.cs:line 807
   at Tensorflow.Binding.tf_with[T](T py, Action`1 action) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Binding.Util.cs:line 199
   at Tensorflow.math_ops.matmul(Tensor a, Tensor b, Boolean transpose_a, Boolean transpose_b, Boolean adjoint_a, Boolean adjoint_b, Boolean a_is_sparse, Boolean b_is_sparse, String name) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Core\Operations\math_ops.cs:line 786
   at Tensorflow.Keras.Layers.Rnn.LSTMCell.Call(Tensors inputs, Tensors states, Nullable`1 training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Layers\Rnn\LSTMCell.cs:line 165
   at Tensorflow.Keras.Engine.Layer.Apply(Tensors inputs, Tensors states, Boolean training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Layer.Apply.cs:line 34
   at Tensorflow.Keras.Layers.Rnn.LSTM.<>c__DisplayClass6_0.<Call>b__0(Tensors inputs, Tensors states) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Layers\Rnn\LSTM.cs:line 80
   at Tensorflow.Keras.BackendImpl.rnn(Func`3 step_function, Tensors inputs, Tensors initial_states, Boolean go_backwards, Tensor mask, Tensors constants, Boolean unroll, Tensors input_length, Boolean time_major, Boolean zero_output_for_mask, Boolean return_all_outputs) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\BackendImpl.cs:line 731
   at Tensorflow.Keras.Layers.Rnn.LSTM.Call(Tensors inputs, Tensors initial_state, Nullable`1 training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Layers\Rnn\LSTM.cs:line 85
   at Tensorflow.Keras.Engine.Layer.Apply(Tensors inputs, Tensors states, Boolean training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Layer.Apply.cs:line 34
   at Tensorflow.Keras.Layers.Rnn.RNN.Apply(Tensors inputs, Tensors initial_states, Boolean training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Layers\Rnn\RNN.cs:line 408
   at Tensorflow.Keras.Engine.Functional.Call(Tensors inputs, Tensors state, Nullable`1 training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Functional.cs:line 352
   at Tensorflow.Keras.Engine.Layer.Apply(Tensors inputs, Tensors states, Boolean training, IOptionalArgs optional_args) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Layer.Apply.cs:line 34
   at Tensorflow.Keras.Engine.Model.train_step(DataHandler data_handler, Tensors x, Tensors y) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Model.Train.cs:line 38
   at Tensorflow.Keras.Engine.Model.train_step_function(DataHandler data_handler, OwnedIterator iterator) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Model.Train.cs:line 15
   at Tensorflow.Keras.Engine.Model.FitInternal(DataHandler data_handler, Int32 epochs, Int32 verbose, List`1 callbackList, Nullable`1 validation_data, Func`3 train_step_func) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Model.Fit.cs:line 259
   at Tensorflow.Keras.Engine.Model.fit(NDArray x, NDArray y, Int32 batch_size, Int32 epochs, Int32 verbose, List`1 callbacks, Single validation_split, Nullable`1 validation_data, Boolean shuffle, Int32 initial_epoch, Int32 max_queue_size, Int32 workers, Boolean use_multiprocessing) in C:\XXX\TensorFlow.NET\src\TensorFlowNET.Keras\Engine\Model.Fit.cs:line 72
   at simple1.Main.main(String[] argv) in C:\XXX\Program.fs:line 98
Wanglongzhi2001 commented 1 year ago

I'm not familiar with F#, and I tested in rnn-dev using the C# code below and it worked well:

var vocab = 21;
var input = keras.layers.Input(vocab);
var embedding = tf.keras.layers.Embedding(vocab, output_dim: 16).Apply(input);
var lstm = tf.keras.layers.LSTM(8).Apply(embedding);

var model = keras.Model(input, lstm);
model.compile(optimizer: keras.optimizers.RMSprop(1e-3f), loss: 
keras.losses.SparseCategoricalCrossentropy(from_logits: true));

Are you using the latest rnn-dev branch? If it is, I will test F# code again to see if there exist some bug. BTW, most of this error is due to the incorrect dimension of mat_mul from inside or outside code. Thank you for your attention to TensorFlow.NET and your enthusiasm and willingness to help. If you would like to help us, you can try to complete the implementation of the GRU, or refactor some redundant code and some classes that difficult to use.

daz10000 commented 1 year ago

Thanks again for your patience and quick replies. The recent problems have been during the data fitting stage (once I joined the rnn-dev branch, the null pointer errors went away.

I can confirm the code above works (for what it's worth, here is the F# equivalent - it's almost identical). I will try to build a better example that also exercises the model.fit phase. Right now I have been running into the problem mentioned in #916 - something odd happens in the embedding stage during data fitting. The input_length is 15, vocab = 21, batch size = 64, so I have a (batch=64 x input_length=15) tensor going into the embedding layer but it looks like the embedding layer is expecting something shaped more like the vocab_size x embedded output size. (see #916 anyway). One question about the above - I might have confused you with my initial example, but should this input line. i.e. if I have 15 token inputs with each token from a 21 member vocabulary, what shape are you expecting for the input?

var input = keras.layers.Input(vocab);

really be

var input = keras.layers.Input(input_length);

module demo1.Main

open type Tensorflow.Binding
open type Tensorflow.KerasApi

let vocab = 21
let input = keras.layers.Input(vocab)
let embedding = tf.keras.layers.Embedding(vocab, output_dim = 16).Apply(input)
let lstm = tf.keras.layers.LSTM(8).Apply(embedding)

let model = keras.Model(input, lstm)
model.compile(
        optimizer = keras.optimizers.RMSprop(1e-3f), 
        loss = keras.losses.SparseCategoricalCrossentropy(from_logits= true))

model.summary()
Wanglongzhi2001 commented 1 year ago

The new release version has been released, you can update your TensorFlow.NET and TensorFlow.Keras version to use LSTM and RNN. ^_^

daz10000 commented 1 year ago

Nice! - I can confirm that the model building stages all run smoothly with the latest package. The full example below still blows up on the last line when it tries to fit this toy data, because of the issue in #916, so I can't fully confirm it's all working, but I trust you have it in hand

#r "nuget:FSharp.Data"
#r "nuget:NumSharp"
#r "nuget:SciSharp.TensorFlow.Redist"
#r "nuget:TensorFlow.Keras"

open type Tensorflow.Binding
open type Tensorflow.KerasApi
open Tensorflow
open Tensorflow.NumPy
open Tensorflow.Keras.Layers

let vocab = 21
let messageLength = 15
let layers = LayersApi()
let inputs = layers.Input(messageLength)

let embedding = layers.Embedding(input_dim = vocab, input_length=15,output_dim = 8).Apply(inputs)
let lstm = tf.keras.layers.LSTM(8).Apply(embedding)
let flatten = layers.Flatten().Apply(lstm)

let dense1 = layers.Dense(32, activation = "relu").Apply(flatten)
let dense2 = layers.Dense(1, activation = "sigmoid").Apply(dense1)
let model = keras.Model(inputs, dense2)

model.summary()

model.compile(
        // optimizer = keras.optimizers.Adam(),
        optimizer = keras.optimizers.RMSprop(),
        loss = keras.losses.BinaryCrossentropy(),
        metrics = [|"accuracy"|]
)

let samples = 1000
let rng = System.Random()
let input = Array2D.init samples messageLength
                            (fun j i -> rng.Next(vocab))

let output =
    Array.init
        samples
        (fun j ->
            let values = [| for i in 0..messageLength-1 -> input.[j,i]  |]
            if (values |> Array.map float32|> Array.average )>= 10.0f then 1.0 else 0.0
        )

model.fit(np.array input,np.array output,epochs=10,batch_size=32)

let samples = 1000
let rng = System.Random()
let input = Array2D.init samples messageLength
                            (fun j i -> rng.Next(vocab))

let output =
    Array.init
        samples
        (fun j ->
            let values = [| for i in 0..messageLength-1 -> input.[j,i]  |]
            if (values |> Array.map float32|> Array.average )>= 10.0f then 1.0 else 0.0
        )
// this line still fails due to the issue with the embedding shapes in #916 
model.fit(np.array input,np.array output,epochs=10,batch_size=32)
> model.fit(np.array input,np.array output,epochs=10,batch_size=32);;
Epoch: 001/010
Tensorflow.InvalidArgumentError: Incompatible shapes: [21,8] vs. [480,8]
   at Tensorflow.Eager.EagerRunner.TFE_FastPathExecute(FastPathOpExecInfo op_exec_info)
   at Tensorflow.Contexts.Context.ExecEagerAction(String OpType, String Name, ExecuteOpArgs args)
   at Tensorflow.Contexts.Context.ExecuteOp(String opType, String name, ExecuteOpArgs args)
   at Tensorflow.math_ops.add_v2(Tensor x, Tensor y, String name)
   at Tensorflow.Tensor.<>c__DisplayClass380_0`2.<BinaryOpWrapper>b__0(NameScope scope)
   at Tensorflow.Tensor.BinaryOpWrapper[Tx,Ty](String name, Tx x, Ty y)
   at Tensorflow.Tensor.op_Addition(Tensor lhs, Tensor rhs)
   at Tensorflow.Keras.Optimizers.RMSprop._resource_apply_dense(IVariableV1 var, Tensor grad, Dictionary`2 _apply_state)
   at Tensorflow.Keras.Optimizers.OptimizerV2.apply_grad_to_update_var(IVariableV1 var, Tensor grad, Dictionary`2 apply_state)
   at Tensorflow.Keras.Optimizers.OptimizerV2.<>c__DisplayClass25_1.<_distributed_apply>b__1(NameScope <p0>)
   at Tensorflow.Binding.tf_with[T](T py, Action`1 action)
   at Tensorflow.Keras.Optimizers.OptimizerV2.<>c__DisplayClass25_0.<_distributed_apply>b__0(NameScope <p0>)
   at Tensorflow.Binding.tf_with[T](T py, Action`1 action)
   at Tensorflow.Keras.Optimizers.OptimizerV2._distributed_apply(IEnumerable`1 grads_and_vars, String name, Dictionary`2 _apply_state)
   at Tensorflow.Keras.Optimizers.OptimizerV2.<>c__DisplayClass20_0.<apply_gradients>b__1(NameScope <p0>)
   at Tensorflow.Binding.tf_with[TIn,TOut](TIn py, Func`2 action)
   at Tensorflow.Keras.Optimizers.OptimizerV2.apply_gradients(IEnumerable`1 grads_and_vars, String name, Boolean experimental_aggregate_gradients)
   at Tensorflow.Keras.Engine.Model._minimize(GradientTape tape, IOptimizer optimizer, Tensor loss, List`1 trainable_variables)
   at Tensorflow.Keras.Engine.Model.train_step(DataHandler data_handler, Tensors x, Tensors y)
   at Tensorflow.Keras.Engine.Model.train_step_function(DataHandler data_handler, OwnedIterator iterator)
   at Tensorflow.Keras.Engine.Model.FitInternal(DataHandler data_handler, Int32 epochs, Int32 verbose, List`1 callbackList, Nullable`1 validation_data, Func`3 train_step_func)
   at Tensorflow.Keras.Engine.Model.fit(NDArray x, NDArray y, Int32 batch_size, Int32 epochs, Int32 verbose, List`1 callbacks, Single validation_split, Nullable`1 validation_data, Boolean shuffle, Int32 initial_epoch, Int32 max_queue_size, Int32 workers, Boolean use_multiprocessing)
   at <StartupCode$FSI_0046>.$FSI_0046.main@() in c:\XXX\demo1\demo1.fsx:line 86
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Stopped due to error
daz10000 commented 1 year ago

Thanks for the fix to #916 , I was able to verify the LSTM model isn't blowing up anymore in my test case. Closing this, and thanks again for the hard work.