-
Oogabooga text-generation-web-ui engine used for inference (prompts directly input into the oogabooga ui produce normal results but chat-ui is doing something weird as below), Mongodb setup
_**Prom…
-
We want to add support for variable-length sequences to the cuDNN RNN operator, as we cannot 'fake' support via masking for LSTM (the cuDNN operator does not return a history of cell states) and bidir…
-
I am studying the TensorCore GEMM codegen of IREE. I notice a big performance gap between IREE and cuBlas. For example, when [M, N, K] is [1024, 512, 1024], I use the following script to run GEMM:
``…
-
# Turing
|Brand Name|GPU Architecture|Tensor Core|NVIDIA CUDA® Cores|TensorFLOPS|Single-Precision|Double-Precision|Mixed-Precision(FP16/FP32)|INT8|INT4|GPU Memory|Interconnect Bandwidth|System Interf…
-
When I use `mapslices(f,a,dims)` to manipulate CuArray, a warning appears. It reminds me that using scalar operations on the GPU is inefficient.
```julia
a=CUDA.rand(3,4,5)
b=CUDA.rand(2,3)
maps…
-
I tried running the matrix multiplication example from the tutorial. I am using 1060 GPU, driver version=465.31 and cuda 11.3
[log.txt](https://github.com/openai/triton/files/6959248/log.txt)
-
Hi,
I read this blog recently https://cloud.google.com/blog/topics/developers-practitioners/building-large-scale-recommenders-using-cloud-tpus, very interested in it and wondering the raw performan…
-
Hi all, I'm new to xformers, I'm learning the `examples/llama_inference/generate.py` file.
I traced it here:
```python
def _memory_efficient_attention_forward(
inp: Inputs, op: Optional[Type…
-
Recently @Hzfengsy brought up a question regarding affine binding and related schedule primitives.
After brief discussions, I put my thoughts here for further discussions.
## Intro case
```pytho…
-
Hi, really good work, and appreciate it a lot.
I am curious whether Triton can support 1-bit acceleration for MMA. Also the further application to 1-bit GPTQ?