-
### 🐛 Describe the bug
```python
# test.py
import torch
import os
rank = int(os.getenv("LOCAL_RANK"))
torch.distributed.init_process_group('nccl', device_id=torch.device(rank))
g = torch.distri…
-
# 🐛 Bug
## To Reproduce
Install torch (direct from https://pytorch.org/get-started/locally/ )
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/…
-
### 🐛 Describe the bug
I am running example codes show in https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/auto_parallel with Pytorch 2.0 (because I need to deploy…
wxthu updated
3 months ago
-
### What is the issue?
```
ollama run llama3.1 (is ok)
```
switch to a different terminal
```
ollama run yi-coder
Error: llama runner process has terminated: CUDA error
ollama run llama3.1 (is…
-
### 🐛 Describe the bug
I have RuntimeError: Shared memory manager connection has timed out when I try to do a training with more than 0 workers.
I'm running the training inside a docker container.…
-
### Your current environment
bash-5.1# python collect_env.py
Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used t…
-
**What happened**:
I'm running a statefulset for my application. It should run a pod on each edge device.
The statefulset is configured to run with a ServiceAccount with node-reader role.
When th…
-
**What happened**:
I deployed a Statefulset that should run on the edge devices.
When I want to delete a pod and reschedule a new one in the same statefulset, I run:
`kubectl delete pod stateful-…
-
**What happened**:
HostAliases not working if pod network is set to hostNetwork
**What you expected to happen**:
HostAliases is injected to `/etc/hosts` correctly in hostNetowork mode.
**How…
-
### Your current environment
```text
PyTorch version: 2.0.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 11 (bullseye) (x…
yk287 updated
1 month ago