SciSharp / TensorFlow.NET

.NET Standard bindings for Google's TensorFlow for developing, training and deploying Machine Learning models in C# and F#.
https://scisharp.github.io/tensorflow-net-docs
Apache License 2.0
3.17k stars 506 forks source link

[BUG Report]: System.NullReferenceException during fit() #1206

Open avipreshel opened 8 months ago

avipreshel commented 8 months ago

Description

I have a pretty plain code with a custom loss function. The code throws a null exception right at the beginning

`using Tensorflow; using Tensorflow.Keras.Losses; using Tensorflow.Keras.Metrics; using Tensorflow.Keras.Optimizers; using Tensorflow.NumPy; using Tensorflow.Operations.Initializers; using static Tensorflow.Binding; using static Tensorflow.KerasApi;

namespace KerasDotNet { internal class WeightedF1Loss : ILossFunc { public string Reduction => throw new NotImplementedException();

    public string Name => nameof(WeightedF1Loss);

    Tensor _beta;
    Tensor _epsilon;
    public WeightedF1Loss(float beta)
    {
        _beta = tf.constant(beta, dtype: TF_DataType.TF_FLOAT);
        _epsilon = tf.constant(float.Epsilon, dtype: TF_DataType.TF_FLOAT);
    }

    public Tensor Call(Tensor y_true, Tensor y_pred, Tensor sample_weight)
    {
        y_pred = tf.cast(y_pred >= 0.7f, TF_DataType.TF_FLOAT);

        var tp = tf.reduce_sum(y_true * y_pred);
        var fp = tf.reduce_sum((1 - y_true) * y_pred);
        var fn = tf.reduce_sum(y_true * (1 - y_pred));
        var precision = tp / (tp + fp + _epsilon);
        var recall = tp / (tp + fn + _epsilon);
        var f1_score = (tf.square(_beta) + 1) * (precision * recall) / (tf.square(_beta) * precision + recall + _epsilon);
        var res = 1 - f1_score;
        return res;
    }
}

internal class Program
{
    static void Main(string[] args)
    {
        //tf.enable_eager_execution(); ==> Runs the same with or without this line
        var x = np.array(new float[,] { { 1, 1, 1 }, { 2, 2, 2 }, { 1, 0, 3 }, { 1, 1, 3 } });
        var y = np.array(new float[,] { { 0 }, { 1 }, { 1 }, { 0 } });

        var loss = new WeightedF1Loss(0.2f);
        var inputs = keras.Input(shape: 3);
        var l1 = keras.layers.Dense(2, activation: "relu").Apply(inputs);
        var outputs = keras.layers.Dense(1, activation: "sigmoid").Apply(l1);

        var model = keras.Model(inputs, outputs);
        model.compile(optimizer: new Adam(), loss: loss, new[] { "precision", "recall" });

        model.summary();

        model.fit(x, y, batch_size: 1, epochs: 10);

    }
}

}`

The exception is thrown from GetDataType() since "data" is null.

Stack trace dump:

Tensorflow.Binding.dll!Tensorflow.Binding.GetDataType(object data) Line 615 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.ops.convert_to_tensor(object value, Tensorflow.TF_DataType dtype, string name, bool as_ref, Tensorflow.TF_DataType preferred_dtype, Tensorflow.Contexts.Context ctx) Line 353 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.tensorflow.convert_to_tensor(object value, Tensorflow.TF_DataType dtype, string name, Tensorflow.TF_DataType preferred_dtype) Line 2038 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.Eager.EagerRunner.AddInputToOp(object inputs, bool add_type_attr, Tensorflow.OpDef.Types.ArgDef input_arg, System.Collections.Generic.List flattened_attrs, System.Collections.Generic.List flattened_inputs, Tensorflow.Eager.SafeEagerOpHandle op, Tensorflow.Status status) Line 450 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.Eager.EagerRunner.TFE_FastPathExecute(Tensorflow.FastPathOpExecInfo op_exec_info) Line 319 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.Contexts.Context.ExecEagerAction(string OpType, string Name, Tensorflow.ExecuteOpArgs args) Line 494 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.Contexts.Context.ExecuteOp(string opType, string name, Tensorflow.ExecuteOpArgs args) Line 534 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.gen_training_ops.resource_apply_adam(Tensorflow.Tensor var, Tensorflow.Tensor m, Tensorflow.Tensor v, Tensorflow.Tensor beta1_power, Tensorflow.Tensor beta2_power, Tensorflow.Tensor lr, Tensorflow.Tensor beta1, Tensorflow.Tensor beta2, Tensorflow.Tensor epsilon, Tensorflow.Tensor grad, bool use_locking, bool use_nesterov, string name) Line 7 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.Adam._resource_apply_dense(Tensorflow.IVariableV1 var, Tensorflow.Tensor grad, System.Collections.Generic.Dictionary<Tensorflow.Keras.Optimizers.DeviceDType, System.Collections.Generic.Dictionary<string, Tensorflow.Tensor>> apply_state) Line 79 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2.apply_grad_to_update_var(Tensorflow.IVariableV1 var, Tensorflow.Tensor grad, System.Collections.Generic.Dictionary<Tensorflow.Keras.Optimizers.DeviceDType, System.Collections.Generic.Dictionary<string, Tensorflow.Tensor>> apply_state) Line 108 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2._distributed_apply.AnonymousMethod__1(Tensorflow.ops.NameScope ) Line 130 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.Binding.tf_with(Tensorflow.ops.NameScope py, System.Action action) Line 256 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2._distributed_apply.AnonymousMethod0(Tensorflow.ops.NameScope ) Line 133 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.Binding.tf_with(Tensorflow.ops.NameScope py, System.Action action) Line 256 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2._distributed_apply(System.Collections.Generic.IEnumerable<(Tensorflow.Tensor, Tensorflow.IVariableV1)> grads_and_vars, string name, System.Collections.Generic.Dictionary<Tensorflow.Keras.Optimizers.DeviceDType, System.Collections.Generic.Dictionary<string, Tensorflow.Tensor>> _apply_state) Line 134 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2.apply_gradients.AnonymousMethod1(Tensorflow.ops.NameScope ) Line 73 C# Symbols loaded. Tensorflow.Binding.dll!Tensorflow.Binding.tf_with<Tensorflow.ops.NameScope, Tensorflow.Operation>(Tensorflow.ops.NameScope py, System.Func<Tensorflow.ops.NameScope, Tensorflow.Operation> action) Line 263 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2.apply_gradients(System.Collections.Generic.IEnumerable<(Tensorflow.Tensor, Tensorflow.IVariableV1)> grads_and_vars, string name, bool experimental_aggregate_gradients) Line 75 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model._minimize(Tensorflow.Gradients.GradientTape tape, Tensorflow.Keras.Engine.IOptimizer optimizer, Tensorflow.Tensor loss, System.Collections.Generic.List trainable_variables) Line 897 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model.train_step(Tensorflow.Keras.Engine.DataAdapters.DataHandler data_handler, Tensorflow.Tensors x, Tensorflow.Tensors y) Line 877 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model.train_step_function(Tensorflow.Keras.Engine.DataAdapters.DataHandler data_handler, Tensorflow.OwnedIterator iterator) Line 856 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model.FitInternal(Tensorflow.Keras.Engine.DataAdapters.DataHandler data_handler, int epochs, int verbose, System.Collections.Generic.List callbackList, (Tensorflow.NumPy.NDArray, Tensorflow.NumPy.NDArray)? validation_data, System.Func<Tensorflow.Keras.Engine.DataAdapters.DataHandler, Tensorflow.OwnedIterator, System.Collections.Generic.Dictionary<string, float>> train_step_func) Line 660 C# Symbols loaded. Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model.fit(Tensorflow.NumPy.NDArray x, Tensorflow.NumPy.NDArray y, int batch_size, int epochs, int verbose, System.Collections.Generic.List callbacks, float validation_split, (Tensorflow.NumPy.NDArray val_x, Tensorflow.NumPy.NDArray val_y)? validation_data, bool shuffle, int initial_epoch, int max_queue_size, int workers, bool use_multiprocessing) Line 474 C# Symbols loaded. KerasDotNet.dll!KerasDotNet.Program.Main(string[] args) Line 62 C# Symbols loaded.

Reproduction Steps

Run the code in the snipped as it's A self contained code (does not ready any files or configuration).

  • TensorFlow.NET v0.110.4
  • TensorFlow.Keras v0.11.4
  • SciSharp.TensorFlow.Redist-Windows-GPU v2.10.3
  • Dotnet 7
  • Visual Studio 2022

System specs: Windows 10 x64 RTX 2080 Super

Known Workarounds

None

Configuration and Other Information

No response

PavelBakurov commented 6 months ago

Same here

ThirdStreetDev commented 5 months ago

I'm having the same problem.

PederHP commented 4 months ago

I have the same problem but in a slightly different situation where I get it for batch sizes greater than 1, but epoch counts greater than 1 do not trigger it.

le-tan-phuc commented 1 month ago

Hi all, I faced the same problem when implementing the custom loss function. It's very confusing when the two loss functions below, one works, and the other one threw the null exception at the GetDataType().

  1. Custom loss based on MSE (this works):

    public class CustomLoss : ILossFunc
    {
    public string Reduction => "auto";
    public string Name => "custom_loss";
    public Tensor Call(Tensor y_true, Tensor y_pred, Tensor sample_weight = null)
    {
          var mse_loss = tf.reduce_mean(tf.square(y_pred - y_true), axis: -1);
          return mse_loss;
    }
    }
  2. My custom loss function, where I convert the y_true and y_pred to float array, do some calculations for the loss function, convert the loss back to Tensor. This is when the error arises.

    public class CustomLoss : ILossFunc
    {
    public string Reduction => "auto";
    public string Name => "custom_loss";
    public Tensor Call(Tensor y_true, Tensor y_pred, Tensor sample_weight = null)
    {
       int batch_size = y_true.shape.as_int_list()[0]; //extract the first element of the shape of the tensor
    
       //convert Tensor to 1D array
       var array_true = y_true.ToArray<float>();
       var array_pred = y_pred.ToArray<float>();
    
       float[] loss = new float[batch_size];
       //perform some calculations here to compute the loss based on array_true and array_pred
       //.........
    
       var loss_tf = tf.convert_to_tensor(loss, dtype: TF_DataType.TF_FLOAT, shape: new Shape(batch_size));
    
       return loss_tf;
    }
    }

The returned Tensor mse_loss and loss_tf seem to have everything similar to each other, including the type, dimension, etc. Yet, the later threw a null at the GetDataType().

I've spent hours, but no luck figuring out the solutions. Any help would be appreciated. Thank you.

AsakusaRinne commented 1 month ago

It seems to be a problem introduced in the latest version. But, I'm sorry, I don't have enough time to dig deeply into it now. GetDataType is something related with the native APIs. If you want to debug it, please clone the repo and run in debug mode with the repo as dependency, instead of the nuget package.

le-tan-phuc commented 1 month ago

Hi @AsakusaRinne, I followed your instructions and got the Call Stack below.

Tensorflow.Binding.dll!Tensorflow.Binding.GetDataType(object data) Line 513 C#
Tensorflow.Binding.dll!Tensorflow.ops.convert_to_tensor(object value, Tensorflow.TF_DataType dtype, string name, bool as_ref, Tensorflow.TF_DataType preferred_dtype, Tensorflow.Contexts.Context ctx) Line 128 C#
Tensorflow.Binding.dll!Tensorflow.tensorflow.convert_to_tensor(object value, Tensorflow.TF_DataType dtype, string name, Tensorflow.TF_DataType preferred_dtype) Line 24 C#
Tensorflow.Binding.dll!Tensorflow.Eager.EagerRunner.AddInputToOp(object inputs, bool add_type_attr, Tensorflow.OpDef.Types.ArgDef input_arg, System.Collections.Generic.List<object> flattened_attrs, System.Collections.Generic.List<Tensorflow.Tensor> flattened_inputs, Tensorflow.Eager.SafeEagerOpHandle op, Tensorflow.Status status) Line 211    C#
Tensorflow.Binding.dll!Tensorflow.Eager.EagerRunner.TFE_FastPathExecute(Tensorflow.FastPathOpExecInfo op_exec_info) Line 126    C#
Tensorflow.Binding.dll!Tensorflow.Contexts.Context.ExecEagerAction(string OpType, string Name, Tensorflow.ExecuteOpArgs args) Line 56   C#
Tensorflow.Binding.dll!Tensorflow.Contexts.Context.ExecuteOp(string opType, string name, Tensorflow.ExecuteOpArgs args) Line 102    C#
Tensorflow.Binding.dll!Tensorflow.gen_training_ops.resource_apply_adam(Tensorflow.Tensor var, Tensorflow.Tensor m, Tensorflow.Tensor v, Tensorflow.Tensor beta1_power, Tensorflow.Tensor beta2_power, Tensorflow.Tensor lr, Tensorflow.Tensor beta1, Tensorflow.Tensor beta2, Tensorflow.Tensor epsilon, Tensorflow.Tensor grad, bool use_locking, bool use_nesterov, string name) Line 27  C#
Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.Adam._resource_apply_dense(Tensorflow.IVariableV1 var, Tensorflow.Tensor grad, System.Collections.Generic.Dictionary<Tensorflow.Keras.Optimizers.DeviceDType, System.Collections.Generic.Dictionary<string, Tensorflow.Tensor>> apply_state) Line 75   C#
Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2.apply_grad_to_update_var(Tensorflow.IVariableV1 var, Tensorflow.Tensor grad, System.Collections.Generic.Dictionary<Tensorflow.Keras.Optimizers.DeviceDType, System.Collections.Generic.Dictionary<string, Tensorflow.Tensor>> apply_state) Line 119    C#
Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2._distributed_apply.AnonymousMethod__1(Tensorflow.ops.NameScope <p0>) Line 142  C#
Tensorflow.Binding.dll!Tensorflow.Binding.tf_with<Tensorflow.ops.NameScope>(Tensorflow.ops.NameScope py, System.Action<Tensorflow.ops.NameScope> action) Line 199   C#
Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2._distributed_apply.AnonymousMethod__0(Tensorflow.ops.NameScope <p0>) Line 140  C#
Tensorflow.Binding.dll!Tensorflow.Binding.tf_with<Tensorflow.ops.NameScope>(Tensorflow.ops.NameScope py, System.Action<Tensorflow.ops.NameScope> action) Line 199   C#
Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2._distributed_apply(System.Collections.Generic.IEnumerable<(Tensorflow.Tensor, Tensorflow.IVariableV1)> grads_and_vars, string name, System.Collections.Generic.Dictionary<Tensorflow.Keras.Optimizers.DeviceDType, System.Collections.Generic.Dictionary<string, Tensorflow.Tensor>> _apply_state) Line 136    C#
Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2.apply_gradients.AnonymousMethod__1(Tensorflow.ops.NameScope <p0>) Line 74  C#
Tensorflow.Binding.dll!Tensorflow.Binding.tf_with<Tensorflow.ops.NameScope, Tensorflow.Operation>(Tensorflow.ops.NameScope py, System.Func<Tensorflow.ops.NameScope, Tensorflow.Operation> action) Line 207 C#
Tensorflow.Keras.dll!Tensorflow.Keras.Optimizers.OptimizerV2.apply_gradients(System.Collections.Generic.IEnumerable<(Tensorflow.Tensor, Tensorflow.IVariableV1)> grads_and_vars, string name, bool experimental_aggregate_gradients) Line 63    C#
Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model._minimize(Tensorflow.Gradients.GradientTape tape, Tensorflow.Keras.Engine.IOptimizer optimizer, Tensorflow.Tensor loss, System.Collections.Generic.List<Tensorflow.IVariableV1> trainable_variables) Line 107    C#
Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model.train_step(Tensorflow.Keras.Engine.DataAdapters.DataHandler data_handler, Tensorflow.Tensors x, Tensorflow.Tensors y) Line 57    C#
Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model.train_step_function(Tensorflow.Keras.Engine.DataAdapters.DataHandler data_handler, Tensorflow.OwnedIterator iterator) Line 16    C#
Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model.FitInternal(Tensorflow.Keras.Engine.DataAdapters.DataHandler data_handler, int epochs, int verbose, System.Collections.Generic.List<Tensorflow.Keras.Engine.ICallback> callbackList, Tensorflow.Util.ValidationDataPack validation_data, System.Func<Tensorflow.Keras.Engine.DataAdapters.DataHandler, Tensorflow.OwnedIterator, System.Collections.Generic.Dictionary<string, float>> train_step_func) Line 282 C#
Tensorflow.Keras.dll!Tensorflow.Keras.Engine.Model.fit(Tensorflow.NumPy.NDArray x, Tensorflow.NumPy.NDArray y, int batch_size, int epochs, int verbose, System.Collections.Generic.List<Tensorflow.Keras.Engine.ICallback> callbacks, float validation_split, Tensorflow.Util.ValidationDataPack validation_data, int validation_step, bool shuffle, System.Collections.Generic.Dictionary<int, float> class_weight, Tensorflow.NumPy.NDArray sample_weight, int initial_epoch, int max_queue_size, int workers, bool use_multiprocessing) Line 85  C#

The Null data type happens within the EagerRunner.TFE_FastPathExecute.cs at this function:

public Tensor[] TFE_FastPathExecute(FastPathOpExecInfo op_exec_info)

From debugging, the op_exec_info.arg[i] happened to be Null at i = 9, during the op_name = "ResourceApplyAdam"

image Capture4

For reference, this is how it looks like when it runs without error: image

As I'm quite new to Tensorflow, I can only trace back the problem to this far. Hopefully it gives you some idea where the issue is.