-
示例:
输入:A:{c=12 h=64 w=26},B:{c=12 h=26 w=64}
输出:C:{c=1 h=64 w=64}
输出结果被简化为了 A[0] * B[0],希望支持channel维度,期望结果应该为 C:{c=12 h=64 w=64}
-
Running the testing/gemm.c with only sgemm (commenting out dgemm code) and larger matrices:
int loop = 0;
for (loop = 1; loop < 2; loop++) {
int M = 10000;
int N = M;
…
-
```
RuntimeError: D:\a\_work\1\s\onnxruntime\python\onnxruntime_pybind_state.cc:743 onnxruntime::python::CreateExecutionProviderInstance CUDA_PATH is set but CUDA wasn't able to be loaded. Please ins…
-
Hi, I want to use the gemx program to compute matrix multiplication in FPGA .
Here I want to know how to measure the execution time of these steps included:
1)read data from DDR to FPGA
2)compute …
-
Hello DP,
Good work!
Currently mi-glas has L1 and half of L3. It would be awesome when we unify DBLAS and GLAS.
The advantages of such integration:
1. Ready to use full featured BLAS !
2. BLA…
-
on https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul
the example runs fine on the existing small m,n,k, but unfortunately when i change my m,n,k to be 8192, i get a runti…
-
Anyone has interest to utilize the sparsity to accelerate DNNs?
I am working on the fork https://github.com/wenwei202/caffe/tree/scnn and currently, on average, achieve ~5x CPU and ~3x GPU layer-wi…
-
I use awq to quantize llama 2 70b-chat by:
```
CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" python quantize_llama.py
```
the codes of quantize_llama.py:
```
from awq import AutoAWQForCausalLM
from tr…
-
Hi there
I am checking `TC - tensor core usage` counter for a standard resnet50 model and although I see tensor core kernels being invoked, their corresponding `TC` counter still shows `-`. Am I do…
-
Hello,
I have pretrained a model with huggingface and attempted to deploy it using the TRTLLM-Triton Server method as documented [here](https://github.com/k2-fsa/sherpa/blob/master/triton/whisper/mod…