-
Hi, I found the error `MPMD detected but reload is not supported yet` will occur if I open `Eager Debug Mode` for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you he…
-
Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
how can i solve this problem?
-
Hi,
I encounter an issue when compiling a scan function with given length. A minimal example is
```
import jax
def do_stuff(length):
def f(carry,x):
return carry, [None]
j…
-
使用https://github.com/ZhuiyiTechnology/pretrained-models里SimBERT Base预训练模型,结果生成的相似句子是乱码。
`Using TensorFlow backend.
>>> gen_synonyms(u'微信和支付宝哪个好?')
2023-01-06 15:53:34.733136: I tensorflow/core/plat…
-
I'm an author of an ML Framework using XLA.
Per issue #11596 in a recent refresh of my build, XLA build fails if I don't include NCCL. The easy fix would be to include NCCL in my build -- also goo…
-
I'm getting the following error on google colab TPU when try to use custom crf loss function.
I check [https://cloud.google.com/tpu/docs/tensorflow-ops](url) for FakeParam operation and looks like o…
-
## 🐛 Bug
PyTorch ResNet18 GPU Training has failed on Colab.
## To Reproduce
I am using [this official notebook](https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab…
rwbfd updated
2 years ago
-
Click to expand!
### Issue Type
Bug
### Have you reproduced the bug with TF nightly?
Yes
### Source
source
### Tensorflow Version
master
### Custom Code
Yes
### OS Platform and Distribut…
-
spmd has a normal training speed using eight blocks on a single machine, but the communication overhead increases rapidly in the case of multiple machines
device is:
gpu:A100 * 8 * 2
spmd strategy …
-
In https://github.com/google/jax/issues/13081 we found that XLA doesn't support SPMD sharding of fast-fourier transform ops. It should!