Open innat opened 8 months ago
Hi @djherbis, Could you please provide any information regarding this issue? Is there any blockers to use tpu-vm at the moment?
@innat Could you share a public notebook with the complete code? That makes it a bit easier to debug, thanks!
Hey, have you confirmed that Keras is using Tensorflow under the hood? I took a quick try at this, I switched to tf-cpu, removed the TPU VM + tensorflow related code, and switched to the Keras backend to JAX and then I think it works?
I don't fully get your points. However, I was able to run keras
with all backend (tf, torch, jax) on cpu and gpu. But as shown in the above gist, for tpu-vm it didn't.
I have run the above gist again with keras+tensorflow
and keras+jax
setup for tpu. And both fail to run the program.
I meant when I ran it as Jax without tensorflow on tpuvm then it worked: https://www.kaggle.com/code/herbison/keras-jax-tpu-vm-model-build-test
Its not too uncommon for something to work on CPU/GPU and not tpu since the actual underlying systems are different.
If possible using the Jax example might be a path forward.
Ah, I see.
I als tried following without installing tf-cpu, didn't work though.
tf.config.set_visible_devices([], "TPU")
import keras, jax
devices = jax.devices("tpu")
data_parallel = keras.distribution.DataParallel(devices=devices)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[5], line 4
1 import keras, jax
----> 4 data_parallel = keras.distribution.DataParallel(devices=devices)
5 keras.distribution.set_distribution(data_parallel)
File /usr/local/lib/python3.10/site-packages/keras/src/distribution/distribution_lib.py:400, in DataParallel.__init__(self, device_mesh, devices)
398 self._batch_dim_name = self.device_mesh.axis_names[0]
399 # Those following attributes might get convert to public methods.
--> 400 self._num_process = distribution_lib.num_processes()
401 self._process_id = distribution_lib.process_id()
402 self._is_multi_process = self._num_process > 1
AttributeError: module 'keras.src.backend.tensorflow.distribution_lib' has no attribute 'num_processes'
Yeah, its impossible to use tensorflow (TPU) install with JAX or Pytorch, and since Keras is calling tensorflow here, thats loading the TPU twice (once for JAX, once for tensorflow) which breaks things.
Installing tensorflow-cpu, and then using JAX (TPU) works though.
While trying to run the following code on tpu-vm, it didn't work.