[BUG Report]: 'Invalid tape state' when training model

PhasonMatrix commented 6 months ago

Description

I have built a U-Net convolutional network. When calling model.fit() I get an exception:

Message: 
  Tensorflow.RuntimeError : Invalid tape state.

Stack Trace: 
  Tape.ComputeGradient(Int64[] target_tensor_ids, Int64[] source_tensor_ids, UnorderedMap`2 sources_that_are_targets, List`1output_gradients, Boolean build_default_zeros_grads)
  EagerRunner.TFE_TapeGradient(ITape tape, Tensor[] target, Tensor[] sources, List`1 output_gradients, Tensor[] sources_raw, String unconnected_gradients)
  GradientTape.gradient(Tensor target, IEnumerable`1 sources, List`1 output_gradients, String unconnected_gradients)
  Model._minimize(GradientTape tape, IOptimizer optimizer, Tensor loss, List`1 trainable_variables)
  Model.train_step(DataHandler data_handler, Tensors x, Tensors y)
  Model.train_step_function(DataHandler data_handler, OwnedIterator iterator)
  Model.FitInternal(DataHandler data_handler, Int32 epochs, Int32 verbose, List`1 callbackList, ValidationDataPack validation_data, Func`3 train_step_func)
  Model.fit(NDArray x, NDArray y, Int32 batch_size, Int32 epochs, Int32 verbose, List`1 callbacks, Single validation_split,   ValidationDataPack validation_data, Int32 validation_step, Boolean shuffle, Dictionary`2 class_weight, NDArray sample_weight, Int32 initial_epoch, Int32 max_queue_size, Int32 workers, Boolean use_multiprocessing)
  UNet.Train(NDArray x, NDArray y) line 68

Reproduction Steps

Summary of the U-Net model:

Model: U-Net 
__________________________________________________________________________________________________ 
Layer (type)                     Output Shape          Param #     Connected to                    
================================================================================================== 
image(InputLayer)                (None, 256, 256, 1)   0                                           
__________________________________________________________________________________________________ 
conv2d(Conv2D)                   (None, 256, 256, 16)  160         image[0][0]                     
__________________________________________________________________________________________________ 
dropout(Dropout)                 (None, 256, 256, 16)  0           conv2d[0][0]                    
__________________________________________________________________________________________________ 
conv2d_1(Conv2D)                 (None, 256, 256, 16)  2320        dropout[0][0]                   
__________________________________________________________________________________________________ 
max_pooling2d(MaxPooling2D)      (None, 128, 128, 16)  0           conv2d_1[0][0]                  
__________________________________________________________________________________________________ 
conv2d_2(Conv2D)                 (None, 128, 128, 32)  4640        max_pooling2d[0][0]             
__________________________________________________________________________________________________ 
dropout_1(Dropout)               (None, 128, 128, 32)  0           conv2d_2[0][0]                  
__________________________________________________________________________________________________ 
conv2d_3(Conv2D)                 (None, 128, 128, 32)  9248        dropout_1[0][0]                 
__________________________________________________________________________________________________ 
max_pooling2d_1(MaxPooling2D)    (None, 64, 64, 32)    0           conv2d_3[0][0]                  
__________________________________________________________________________________________________ 
conv2d_4(Conv2D)                 (None, 64, 64, 64)    18496       max_pooling2d_1[0][0]           
__________________________________________________________________________________________________ 
dropout_2(Dropout)               (None, 64, 64, 64)    0           conv2d_4[0][0]                  
__________________________________________________________________________________________________ 
conv2d_5(Conv2D)                 (None, 64, 64, 64)    36928       dropout_2[0][0]                 
__________________________________________________________________________________________________ 
max_pooling2d_2(MaxPooling2D)    (None, 32, 32, 64)    0           conv2d_5[0][0]                  
__________________________________________________________________________________________________ 
conv2d_6(Conv2D)                 (None, 32, 32, 128)   73856       max_pooling2d_2[0][0]           
__________________________________________________________________________________________________ 
dropout_3(Dropout)               (None, 32, 32, 128)   0           conv2d_6[0][0]                  
__________________________________________________________________________________________________ 
conv2d_7(Conv2D)                 (None, 32, 32, 128)   147584      dropout_3[0][0]                 
__________________________________________________________________________________________________ 
max_pooling2d_3(MaxPooling2D)    (None, 16, 16, 128)   0           conv2d_7[0][0]                  
__________________________________________________________________________________________________ 
conv2d_8(Conv2D)                 (None, 16, 16, 256)   295168      max_pooling2d_3[0][0]           
__________________________________________________________________________________________________ 
dropout_4(Dropout)               (None, 16, 16, 256)   0           conv2d_8[0][0]                  
__________________________________________________________________________________________________ 
conv2d_9(Conv2D)                 (None, 16, 16, 256)   590080      dropout_4[0][0]                 
__________________________________________________________________________________________________ 
conv2d_transpose(Conv2DTranspose (None, 32, 32, 128)   131200      conv2d_9[0][0]                  
__________________________________________________________________________________________________ 
concatenate(Concatenate)         (None, 32, 32, 256)   0           conv2d_transpose[0][0]          
                                                                   conv2d_7[0][0]                  
__________________________________________________________________________________________________ 
conv2d_10(Conv2D)                (None, 32, 32, 128)   295040      concatenate[0][0]               
__________________________________________________________________________________________________ 
dropout_5(Dropout)               (None, 32, 32, 128)   0           conv2d_10[0][0]                 
__________________________________________________________________________________________________ 
conv2d_11(Conv2D)                (None, 32, 32, 128)   147584      dropout_5[0][0]                 
__________________________________________________________________________________________________ 
conv2d_transpose_1(Conv2DTranspo (None, 64, 64, 64)    32832       conv2d_11[0][0]                 
__________________________________________________________________________________________________ 
concatenate_1(Concatenate)       (None, 64, 64, 128)   0           conv2d_transpose_1[0][0]        
                                                                   conv2d_5[0][0]                  
__________________________________________________________________________________________________ 
conv2d_12(Conv2D)                (None, 64, 64, 64)    73792       concatenate_1[0][0]             
__________________________________________________________________________________________________ 
dropout_6(Dropout)               (None, 64, 64, 64)    0           conv2d_12[0][0]                 
__________________________________________________________________________________________________ 
conv2d_13(Conv2D)                (None, 64, 64, 64)    36928       dropout_6[0][0]                 
__________________________________________________________________________________________________ 
conv2d_transpose_2(Conv2DTranspo (None, 128, 128, 32)  8224        conv2d_13[0][0]                 
__________________________________________________________________________________________________ 
concatenate_2(Concatenate)       (None, 128, 128, 64)  0           conv2d_transpose_2[0][0]        
                                                                   conv2d_3[0][0]                  
__________________________________________________________________________________________________ 
conv2d_14(Conv2D)                (None, 128, 128, 32)  18464       concatenate_2[0][0]             
__________________________________________________________________________________________________ 
dropout_7(Dropout)               (None, 128, 128, 32)  0           conv2d_14[0][0]                 
__________________________________________________________________________________________________ 
conv2d_15(Conv2D)                (None, 128, 128, 32)  9248        dropout_7[0][0]                 
__________________________________________________________________________________________________ 
conv2d_transpose_3(Conv2DTranspo (None, 256, 256, 16)  2064        conv2d_15[0][0]                 
__________________________________________________________________________________________________ 
concatenate_3(Concatenate)       (None, 256, 256, 32)  0           conv2d_transpose_3[0][0]        
                                                                   conv2d_1[0][0]                  
__________________________________________________________________________________________________ 
conv2d_16(Conv2D)                (None, 256, 256, 16)  4624        concatenate_3[0][0]             
__________________________________________________________________________________________________ 
dropout_8(Dropout)               (None, 256, 256, 16)  0           conv2d_16[0][0]                 
__________________________________________________________________________________________________ 
conv2d_17(Conv2D)                (None, 256, 256, 16)  2320        dropout_8[0][0]                 
__________________________________________________________________________________________________ 
conv2d_18(Conv2D)                (None, 256, 256, 1)   17          conv2d_17[0][0]                 
================================================================================================== 
Total params: 1940817 
Trainable params: 1940817 
Non-trainable params: 0

Code to build the model is mentioned in my previous bug report https://github.com/SciSharp/TensorFlow.NET/issues/1219

I call the fit method with:

_model.fit(x, y, batch_size:16, verbose:1, epochs:1, shuffle:false);

Input and labels passed in ( 'x' and 'y') are both NDArray with shape (1872, 256, 256, 1). That is, 1872 grey-scale images, 256x256px.

I have googled and can only find one StackOverflow answer that mentions that the labels (y) should be passed in, which I have already done.

I have the same model in Python and can train it with X and Y numpy arrays of similar shape (8100, 256, 256, 1), only difference is a different number of images.

Known Workarounds

No response

Configuration and Other Information

OS: Windows 11 .Net: 6.0

SciSharp.TensorFlow.Redist 2.16.0 TensorFlow.Keras 0.15.0 TensorFlow.Net 0.150.0

SIARIAymane commented 3 months ago

Hello,

I wish to report that I am also experiencing the issue described here, namely Tensorflow.RuntimeError: Invalid tape state.

Here are some details about my environment:

TensorFlow.NET Version: 0.150.0
Operating System: Windows 10
IDE: Visual Studio 2022
Usage Scenario: Training a U-Net model for image segmentation.

I am interested in any suggestions or solutions that may have been found since this issue was created. Moreover, if additional information from my side could help resolve this issue, I would be happy to provide it.

Thank you very much for your attention and for any effort aimed at resolving this issue. It is very important to me and my project.

Kind regards, Aymane.

AsakusaRinne commented 3 months ago

Hi, I'm one of the maintainers of tensorflow.net. However I'm sorry that none of the main maintainers of this repo is available at this time. We won't reject PRs but we don't have enough time to fix BUG or add features now. I feel sorry for that.

I've once met the same problem during the development. Generally, this BUG is because of wrong traced graph structure info or invalid backward ops in your model.

Tape works as below: it records the information of nodes and edges of the graph, which is traced during the model running. When it's required to compute gradients, it pops the nodes at topological order, begging from the output node(s).

If you want debug it, please at first narrow the scope for debugging, finding a smallest model structure which could reproduce this problem. Then, run it with the source code and see the records in the Tape. You'll finally find which number of the operation is missed in the tape informations. After that, you could try to fix it. Good luck!

SIARIAymane commented 3 months ago

Thank you for your advice

SciSharp / TensorFlow.NET