benoitkoenig / tfjs-replicate-layers-model

Utility function to help update the input/output layers of a LayersModel with tensorflow.js
MIT License
0 stars 0 forks source link

Name mismatch between the variable names of my LayersModel when working on multiple threads #1

Open benoitkoenig opened 1 week ago

benoitkoenig commented 1 week ago

I have an issue when using tfjs-replicate-layers-model and working on multiple threads. The process follows this logic:

In the main process, I load an existing model and replicate it using tfjs-replicate-layers-model. I dispose the original model and save the replicated model to file.

In a worker thread, I load the model from file, generate some data, and compute gradients. I then serialize those gradients and send them back to the main thread.

Back in the main thread, I receive the gradients, de-serialize them, try to apply them and get the following error: "Error: Argument 'x' passed to 'zerosLike' must be a Tensor or TensorLike, but got 'null'"

It turns that that there is a name mismatch between the variables in the main and worker threads. The worker thread sent gradients named "conv2d_Conv2D1/bias" and "conv2d_Conv2D1/kernel", but on the main thread, tf.engine().registeredVariables contains "conv2d_Conv2D1/bias_1" and "conv2d_Conv2D1/kernel_1".

benoitkoenig commented 1 week ago

This name mismatch is due to a series of behaviors where we're hitting an edge case

  1. The loaded model's layers and the replicated model's layers share the same name

This only happens when loading a model, not when creating it using tf.model. This is due to the way layer names are incremented in @tensorflow/tfjs-layers: When creating a new layer, an internal counter is incremented to ensure that the layer name is unique, but that internal counter fails to take into account loaded models

  1. The name of the layer's variable does not appear to be serialized

When loading the existing model, its variables are likely named "conv2d_Conv2D1/bias" and "conv2d_Conv2D1/kernel", so when the model is replicated, the variables are named "conv2d_Conv2D1/bias_1" and "conv2d_Conv2D1/kernel_1". It appears that tfjs does not serialize those variable names, so when the worker thread loads its weights, it name them "conv2d_Conv2D1/bias" and "conv2d_Conv2D1/kernel", which is the main thread is the name of the variables of the disposed model.

How to fix this issue

This is really tricky and I'm not sure how to proceed. Currently, it seems that tfjs guarantees that variable names be unique (per thread). Since the layers names are saved into the files, they cannot be guaranteed to be unique (we could load two models with a layer that share the same name). I'll try to write a minimum reproduction repository and maybe see with tensorflow/tfjs if they have a recommendation here. The best solution I have to offer currently is to use tfjs-replicate-layers-model in a separate script that will write the resulting model into file, and use that updated model with multi-threading.