-
### Your current environment information
onediff 1.2.1.dev6
oneflow 0.9.1.dev20240730+cu121
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c)…
-
I am trying to run inference with mistralai/Mixtral-8x22B-v0.1 model, but it is generating random output with an 8-way tensor parallel setup. Below are the details of the configuration and I think the…
-
### 🐛 Describe the bug
Traceback (most recent call last):
File "issue_onnx2torch_004.py", line 20, in
torch_output = torch_model(torch_input)
File "/opt/conda/envs/tf2onnx/lib/python3…
-
### Your current environment
The output of `python collect_env.py`
```text
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N…
-
### 🐛 Describe the bug
Consider the following piece of code:
```python
# run me with
# torchrun --nproc_per_node=2 repro.py
import os
import torch
from torch import distributed as dist
…
RuRo updated
5 months ago
-
# 🐛 Bug
Reading Cuda tensors from multiprocessing queue causes child (reader) process to hang.
I discovered that process hangs on any operation (sum, min, max, mean, etc.), in my example code on "…
-
### 1. Issue or feature description
We try to configure OpenShift environment to use NVIDIA vGPU using the NVIDIA gpu operator. We followed the steps as described in this [guide in NVIDIA vgpu docume…
-
### 🐛 Describe the bug
Reproduced below, `dist.barrier()` fails after calls to `torch.distributed.checkpoint.async_save`.
Interestingly enough, this does not happen if we first call `all_reduce…
-
Hi all!
I am trying my best to make vulkan work under alma9.4 running a T4 GPU on aws.
Compiles VulkanSDK and Vulkan-Samples succesfully.
Error i am facing now is:
> ./build/linux/app/bin/R…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch…