large-kernels Search Results

1000+ results
for large-kernels

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

iree-org/iree #17078

Umbrella tracking bug for compilation/execution time issue o…

As turbine LLM (and different quantizations schemes implemented and lowered as part of it) ramps up, the CPU backend needs to be able to compile and execute sample kernels in reasonable time frame. Se…

MaheshRavishankar updated 6 months ago
5
rapidsai/build-planning #4

Consider statically linking the CUDA runtime

Currently RAPIDS libraries support static linkage to cudart via a CMake flag `CUDA_STATIC_RUNTIME`. This flag is leveraged by wheel builds and by the Spark-RAPIDS JNI (specifically for cudf), but it i…

vyasr updated 6 months ago
1
pytorch/ao #47

[RFC] Plans for torchao

### Summary Last year, we released [pytorch-labs/torchao](https://github.com/pytorch-labs/ao) to provide acceleration of Generative AI models using native PyTorch techniques. Torchao added support …

supriyar updated 6 months ago
21
Lightning-AI/lightning-thunder #348

Distributed and Bucketing Performance Improvements

## 🐛 Bug This is a lengthy issue/post detailing my observations with our distributed and bucketing performance. Some of these are actionable items and some are just observations to be aware of. …

parthmannan updated 5 months ago
2
MiCode/Xiaomi_Kernel_OpenSource #6932

Kernel flashing on ishtar - A/B partitioned devices with dyn…

Hello, Kernel Developers! I hope that somebody can help. :) I'm trying to flash the kernel I build for ishtar (Xiaomi 13 Ultra). I have fixed few errors which I have got with the Xiaomi github r…

lupo-ch updated 4 months ago
2
tbenthompson/cutde #14

Chunk all the CUDA calls

Currently, the screen will freeze if a user runs a large cutde calculation on the GPU that also drives their monitors. This is avoided by chunking the calculation in `disp_blocks` and `disp_aca`. Chun…

tbenthompson updated 3 years ago
1
halide/Halide #2054

[tutorial] composite compute kernels

It is not clear to me what is the recommended way of combining multiple computation kernels into one larger operation and reuse the code. For example if I have blur and gradient implemented separately…

palindromoroz updated 7 years ago
4
apache/arrow #38386

[C++][Compute] Support Recordbatch sorting for dictionary ty…

### Describe the enhancement requested Hello. While implementing join operation support for the Dictionary type, I encountered the following message. I am attempting to support the Dictionary ty…

llama90 updated 1 year ago
1
cp2k/dbcsr #795

DBCSR performs very poorly on GH200, when there are large bl…

I am currently testing CP2K on the new CSCS machines with GH200 chips. In most cases, DBCSR behaves well (e.g. with the `benchmarks/QS/H2O-XXX.inp`) tests. However, when large block sizes are involved…

abussy updated 4 months ago
14
ammarhakim/gkylzero #395

[DR]: GPU hackathon gk-g0-app work on parallelizing over com…

This design review encompasses the ongoing work in the branch https://github.com/ammarhakim/gkylzero/tree/gk-g0-app-gpu-hack2024 for the GPU hackathon. The main restructuring of the code is a targ…

JunoRavin updated 4 months ago
1

上一页 1...14 15 16 17 18 19 20...100 下一页

1000+ results for large-kernels

1000+ results
for large-kernels