-
**Issue**:
I am creating an AWS ECS Fargate Cluster using the Dask Cloudprovider library, following . Although the cluster is successfully created (status is active) and the workers are trigger…
-
After a fresh installation of Julia on CentOS 7.2, I added TensorFlow, ran the "basic usage" test in README.md and passed it. Then, after installing also the Distributions and Printf packages, I tried…
-
Description
===========
We are planning to stream MySQL data using CDC onto EventHub using KafkaConnect. I have done all the required configuration but the connector gives following error:
`{"nam…
-
I was able to train Llama3-8b model with Thunder for a few steps and then save it. However when I try to use later `litgpt generate` or `litgpt chat` with the saved checkpoint I get an error about si…
-
**Is your feature request related to a problem? Please describe.**
The asymmetric signing configuration parameters only support a single key. The use of a single key means that rotation will cause …
-
Observed a distributed deadlock when testing a recent work on allowing truncate on MX nodes.
Verified that the deadlock does not occur on single node (not distributed) configuration.
- create 2 …
-
## 🐛 Bug
Was trying to launch a distributed job with 2 nodes each with 4GPU using fairseq-hydra-train. Single node multigpu using fairseq-hydra-train without `torch.distributed.run` can run success…
hannw updated
3 years ago
-
## What is it?
Use OpenTelemetry to add tracing events and top-level counters for exporting to monitors and the health endpoint.
### Value prop
Besides aligning with industry trends, **Correl…
-
Does the project support multi-gpu training?
If yes, how? By default, it only uses one GPU. I am unable to find any parameter that can be used for this purpose.
Snimm updated
1 month ago
-
### 🚀 The feature, motivation and pitch
Today `C10D_NCCL_CHECK_TIMEOUT` implements a while loop that calls `ncclCommGetAsyncError` in a busy looping manner.
At the very least, we should add `sch…