SciSharp / TensorFlow.NET

.NET Standard bindings for Google's TensorFlow for developing, training and deploying Machine Learning models in C# and F#.
https://scisharp.github.io/tensorflow-net-docs
Apache License 2.0
3.17k stars 506 forks source link

Error when saving model using model.save #1246

Open le-tan-phuc opened 1 month ago

le-tan-phuc commented 1 month ago

Description

Hi, I'm new to Tensorflow.net. I'm just playing around with the "Toy version of ResNet in Keras" example on the main page and got an error at the model.save("./toy_resnet_model");

C# threw an exception: System.InvalidOperationException: 'Collection was modified; enumeration operation may not execute.'

I tried to debug and trace the problem and it seems like the exception was thrown at some point within this function:

(MetaGraphDef, Graph, TrackableSaver, AssetInfo, IList<Trackable>, IDictionary<Trackable, IEnumerable<TrackableReference>>) tuple = _build_meta_graph(obj, signatures, options, metaGraphDef);

which is part of

(saved_nodes, node_paths) = SavedModelUtils.save_and_return_nodes(model, filepath, signatures, options);

which is part of

KerasSavedModelUtils.save_model(this, filepath, overwrite, include_optimizer, signatures, options, save_traces);

Package installed:

Any help would be appreciated! Thank you.

AdrienDeverin commented 1 month ago

I encountered the same problem months ago. #1017 My conclusion was that some layers (notably the Cropping layer in my case) weren't managed properly. Could you provide the model that caused the problem?

le-tan-phuc commented 1 month ago

Hi @AdrienDeverin, I just put the exact example from the TensorFlow.NET GitHub page to try out and face this problem. I attached the code again here for easy reference, and the error occurs at the last line, which is the model.save. I have also tried changing the save_format from "tf" to "h5" and it runs without error, but nothing was saved:

using static Tensorflow.Binding;
using static Tensorflow.KerasApi;
using Tensorflow;
using Tensorflow.NumPy;

var layers = keras.layers;
// input layer
var inputs = keras.Input(shape: (32, 32, 3), name: "img");
// convolutional layer
var x = layers.Conv2D(32, 3, activation: "relu").Apply(inputs);
x = layers.Conv2D(64, 3, activation: "relu").Apply(x);
var block_1_output = layers.MaxPooling2D(3).Apply(x);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_1_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
var block_2_output = layers.Add().Apply(new Tensors(x, block_1_output));
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_2_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
var block_3_output = layers.Add().Apply(new Tensors(x, block_2_output));
x = layers.Conv2D(64, 3, activation: "relu").Apply(block_3_output);
x = layers.GlobalAveragePooling2D().Apply(x);
x = layers.Dense(256, activation: "relu").Apply(x);
x = layers.Dropout(0.5f).Apply(x);
// output layer
var outputs = layers.Dense(10).Apply(x);
// build keras model
var model = keras.Model(inputs, outputs, name: "toy_resnet");
model.summary();
// compile keras model in tensorflow static graph
model.compile(optimizer: keras.optimizers.RMSprop(1e-3f),
    loss: keras.losses.SparseCategoricalCrossentropy(from_logits: true),
    metrics: new[] { "acc" });
// prepare dataset
var ((x_train, y_train), (x_test, y_test)) = keras.datasets.cifar10.load_data();
// normalize the input
x_train = x_train / 255.0f;
// training
model.fit(x_train[new Slice(0, 2000)], y_train[new Slice(0, 2000)],
            batch_size: 64,
            epochs: 10,
            validation_split: 0.2f);
// save the model
model.save("./toy_resnet_model");
AdrienDeverin commented 1 month ago

Have you try with a direct path folder (folder need to be created before) ? Example : @"C:\GitHub\TestTF\ResnetModel"

Normally you get .pb file in it after

le-tan-phuc commented 1 month ago

Thanks for your prompt response. I've tried to put a full path folder, but the problem remains. This is what is shown in the output log:

Exception thrown: 'System.InvalidOperationException' in mscorlib.dll
An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll
Collection was modified; enumeration operation may not execute.
AdrienDeverin commented 1 month ago

It's really strange. I tried it myself an no problem appear. Everything go well.

Try with this config (normally it doesn't matter, your seems good) :

My import :

using Tensorflow;               
using Tensorflow.NumPy;   
using Tensorflow.Keras;
using Tensorflow.Keras.Layers;
using Tensorflow.Keras.Saving;
using Tensorflow.Keras.Engine; 
using Tensorflow.Keras.Losses;
using Tensorflow.Keras.Utils;
using Tensorflow.Keras.ArgsDefinition;
using Tensorflow.Keras.ArgsDefinition.Reshaping;
using Tensorflow.Operations.Activation;
using Tensorflow.Operations.Initializers;
using Tensorflow.Common.Types;
using static Tensorflow.KerasApi;
using static Tensorflow.Binding; 
using static Tensorflow.ops;
using static Tensorflow.ApiDef.Types;
AsakusaRinne commented 1 month ago

Hi, I have tried your code but I failed to reproduce it. I ran it and everything seemed to go well.

The only difference between our code is that I changed epoch to 1 and batch to 4 to make it faster to complete the training. I guess that doesn't matter.

P.S. I was using the CPU redist package.

le-tan-phuc commented 1 month ago

Hi @AdrienDeverin and @AsakusaRinne, thank you both for your help. Let me explain what exactly happened: I have 2 computers, A and B. I started trying the example code on computer A, where I created a new project with .NET Framework 4.8. Then I faced the problem described at the beginning. I tried different stuff as suggested by @AdrienDeverin but it didn't solve the problem. Later on, I created a new project based on .NET 8.0 on computer A, and it worked like a charm. I tried to replicate the solution on computer B, also with .NET 8 and all the same Nuget packages installed. It now gives me a different error on the model.save:

System.NotImplementedException: ''

Have you experienced this before? This is pretty confusing for me.

Ps: this is the project properties for your reference. I'm using VS2022 V17.9.6 Capture

AsakusaRinne commented 1 month ago

I didn't manage to reproduce it on my PC. Could you please clone the repo and add project reference to it, so that a detailed trace back will be shown?

AdrienDeverin commented 1 month ago

Me too, I tried to reproduce your bug, but it's working correctly on my computer... :/ (To add more, I was testing it in .NET 6.0)

Another idea to understand where the problem lies: since you have done what I said earlier, you could use the debug mode and see in the code step by step where you go (and compare with computer A)...

le-tan-phuc commented 1 month ago

Hi all, thank you for your support. I've tried again with a fresh project on computer B based on .NET 8.0 in a local folder and it works perfectly now. I guess the previous problem on System.NotImplementedException: maybe because the project folder was placed in Onedrive of computer A, got synced to computer B and somehow it ends up with that error when run on computer B because some files are missing. I also tried with .NET 6.0 as @AdrienDeverin , and it worked as well now. Nevertheless, a new project based on .NET Framework 4.5 still has the original problem, maybe it's not supported.

AsakusaRinne commented 1 month ago

Nevertheless, a new project based on .NET Framework 4.5 still has the original problem, maybe it's not supported.

It's expected to support .NET framework 4.5. Could you please run with the tf.net repo and put the detailed traceback here if you'd like to dig on it?

le-tan-phuc commented 1 month ago

I was figuring out a solution to make .NET Framework-based app work, but not sure if this is a bug or anything. Let me detail the process so that someone facing the same issue knows how to get through it. The original problem was:

  1. I created a new c# winform app using .NET Framework 4.8, and set it to work with x64 only.
  2. Installed nuget packages: TensorFlow.NET, SciSharp.TensorFlow.Redist, and TensorFlow.Keras. To successfully install TensorFlow.Keras, I needed to install the PureHDF separately first (by checking the Include prerelease).
  3. Copy the example from the main SciSharp github page
  4. Copy the tensorflow.dll from SciSharp.TensorFlow.Redist.2.16.0 package into the debug folder (to clear the backend not found exception)
  5. Got the System.InvalidOperationException: 'Collection was modified; enumeration operation may not execute.' at the model.save

How did I get it work:

  1. Download the entire TensorFlow.Net repo -> create a new .NET Framework 4.8 project within the TensorFlow.NET solution.
  2. Installed nuget packages: TensorFlow.NET, SciSharp.TensorFlow.Redist, and TensorFlow.Keras. This is to get all the dependencies to be installed. After that, uninstall TensorFlow.NET and TensorFlow.Keras.
  3. Copy the tensorflow.dll into the debug folder
  4. In my project, add a reference to the Tensorflow.Binding and Tensorflow.Keras from the repo.
  5. The application works smoothly now without error.

When I checked the output debug folder, I noticed the size difference in the Tensorflow.Binding.dll and Tensorflow.Keras.dll between the original and the solution. Copying these two files from the updated solution folder to the previous project folder solve the error too. Thus, I guess there should be some differences in the Tensorflow.Binding.dll and Tensorflow.Keras.dll between the release NuGet packages and the repo. Do you have any idea on this @AsakusaRinne?