-
### Your current environment
```text
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC ve…
-
It might be useful to be able to use the CPU L-cache sizes in the benchmark,
e.g. when desiding which maximal Range to use.
I'm not sure how that should be exposed though.
It should be reachable wi…
-
### 🐛 Describe the bug
The case comes from xpu [triton-benchmark](https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/benchmarks/triton_kernels_benchmark/flash_attention_fwd_benc…
-
### 🐛 Describe the bug
When running the below code the python interpreter hangs:
```
from multiprocessing import get_context
from torch import randn
def worker(i):
result = randn(1)
p…
-
在opensora/serve/gradio_web_server.py 里引用了
`text_encoder = MT5EncoderModel.from_pretrained("/storage/ongoing/new/Open-Sora-Plan/cache_dir/mt5-xxl", cache_dir=args.cache_dir,
…
-
### 🐛 Describe the bug
When compiling flex_attention, any backwards operation crashes with the error `BypassFxGraphCache: Can't cache HigherOrderOperators.`.
Without compile its fine, but slow. I …
-
### What is the issue?
I offloaded 47 out of 127 layers of Llama 3.1 405b q2 on an M3 Max with 64GB of RAM.
When I run the inference, the memory usage shows only about 8GB, while the cached memory…
-
### Description
In https://github.com/dotnet/runtime/issues/48937 it was found that my gen0 budget was 32MiB. Investigating this further, I believe it may even be as high as 64MiB which causes the …
-
### 🐛 Describe the bug
Running torch.divide with 0 as the denominator does not throw ZeroDivisionError on GPU, neither does it result in inf. Executing on CPU throws ZeroDivisionError as expected.
…
-
CI: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/4081#01918cdc-edef-40fc-9a36-c1ec173e5a63
Platform: multiple
Logs:
```
ERROR: [0mTraceback (most recent call last):
Er…