Open benoitkoenig opened 2 days ago
This only happens when loading a model, not when creating it using tf.model
. This is due to the way layer names are incremented in @tensorflow/tfjs-layers: When creating a new layer, an internal counter is incremented to ensure that the layer name is unique, but that internal counter fails to take into account loaded models
When loading the existing model, its variables are likely named "conv2d_Conv2D1/bias" and "conv2d_Conv2D1/kernel", so when the model is replicated, the variables are named "conv2d_Conv2D1/bias_1" and "conv2d_Conv2D1/kernel_1". It appears that tfjs does not serialize those variable names, so when the worker thread loads its weights, it name them "conv2d_Conv2D1/bias" and "conv2d_Conv2D1/kernel", which is the main thread is the name of the variables of the disposed model.
This is really tricky and I'm not sure how to proceed.
Currently, it seems that tfjs guarantees that variable names be unique (per thread). Since the layers names are saved into the files, they cannot be guaranteed to be unique (we could load two models with a layer that share the same name). I'll try to write a minimum reproduction repository and maybe see with tensorflow/tfjs if they have a recommendation here. The best solution I have to offer currently is to use tfjs-replicate-layers-model
in a separate script that will write the resulting model into file, and use that updated model with multi-threading.
I have an issue when using
tfjs-replicate-layers-model
and working on multiple threads. The process follows this logic:In the main process, I load an existing model and replicate it using
tfjs-replicate-layers-model
. I dispose the original model and save the replicated model to file.In a worker thread, I load the model from file, generate some data, and compute gradients. I then serialize those gradients and send them back to the main thread.
Back in the main thread, I receive the gradients, de-serialize them, try to apply them and get the following error: "Error: Argument 'x' passed to 'zerosLike' must be a Tensor or TensorLike, but got 'null'"
It turns that that there is a name mismatch between the variables in the main and worker threads. The worker thread sent gradients named "conv2d_Conv2D1/bias" and "conv2d_Conv2D1/kernel", but on the main thread,
tf.engine().registeredVariables
contains "conv2d_Conv2D1/bias_1" and "conv2d_Conv2D1/kernel_1".