-
Hi Accelerate Team,
I'm looking to use `run_mlm_no_trainer.py` on TPU v3-128 pod. I have few questions before I want to get started with the process.
1. Can I stream the data directly GCP bucket…
-
## ❓ Questions and Help
At end of my training, I'm seeing prints of
```
2022-10-06 01:31:52.303389: E tensorflow/core/grappler/clusters/utils.cc:87] Failed to get device properties, error code: 4
…
-
Hi there,
Your work is amazing! After success with nvdiffrec, I tried to get nvdiffrecmc running on an A10G 24GB (AWS g5.2xlarge instance). As with nvdiffrec, I've been running nvdiffrecmc in the D…
-
## 🐛 Bug
In the r1.13 release 2vm image, the dlrm test crashes. The error I got is:
```
vm:~$ pip install onnx
Collecting onnx
Downloading onnx-1.12.0-cp37-cp37m-manylinux_2_17_x86_64.manyl…
-
**Question**
I sometimes see some "gaps" in some measurement in HA. Checking the EMS-ESP status then I see that around those times the status was disconnected. However, when checking the logs I do no…
-
Meshcentral
![Meshcentral](https://user-images.githubusercontent.com/42680639/196946309-3fa9373f-6dd0-4e85-97d1-c54740dfa282.jpg)
My Server
![My Server](https://user-images.githubusercontent.…
-
Here is the outline for the survey in markdown format. I coded up the export pretty quickly so it's a bit rough, feel free to disregard any weirdness of things that don't make sense. I will also deplo…
-
## ❓ Questions and Help
Hi I've got an existing pytorch model training that I'm trying to migrate off laptop to colab. I tried to pattern my notebook off the excellent multi-core-alexnet-fashion-mnis…
-
**Please describe the bug**
Trying to train a GPT3 6.7B parameters model using the code https://github.com/alpa-projects/alpa/blob/main/examples/gpt2/run_clm_flax.py on 2 nodes, each with 8 V100 GPU…
-
In our current codebase, `DeviceCluster`, `VirtualPhysicalMesh`, `PhysicalDeviceMesh`, and `MeshHostWorker` have the relationship as in the following diagram:
![Mesh_Worker Refactor](https://user-i…