-
Does it make sense to extend tensordot to support more than two input arrays?
The definition and API seem to be amenable to this extension, though I can't say anything about the implementation.
-
This is supported in rocMLIR side.
This is basically to add pow to the allowed list of pointwise ops in fuse_mlir pass with a verify test to MIGraphX.
DoD:
* A verify test with pow operator trail…
-
Rename at least functions which are [exported](https://github.com/libxsmm/libxsmm/blob/main/.abi.txt) and not adhering to best practices and typical API conventions. For example:
* Replace term "cr…
-
### Describe the issue
We are trying to quantize our proprietary model based on RetinaNet using TensorRT's model optimization library. The following warning was raised: **"Please consider running pre…
-
As we start to integrate more advanced hybrid methods on the GPU we are finding that [most numpy functions]() are not supported on the GPU. I think we have two options here (1) reimplement all operati…
-
For 2D inputs, `np.matmul` and `np.dot` are semantically the same, but I've found that in some cases `matmul` can be much slower even though the documentation for `np.dot` says `matmul` is preferred f…
-
硬件环境:
[root@iZ6we55nj5ujtoxm12k2wwZ ~]# nvidia-smi
Fri Sep 27 15:14:37 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 …
-
Hello,
We are using latest main TensorRT LLM and container build with TensorRT-Backend to run Mixtral. Generation doesn't stop and goes until max_tokens is reached. Passing "end_id": 2 doesn't help.
…
-
### Problem Description
On Llama3 70B Proxy Model, the training stalls & gpucore dumps. The gpucore dumps are 41GByte per GPU thus i am unable to send it. Probably easier for yall to reprod this er…
-
안녕하세요. 굉장히 놀라운 작업에 감사드립니다.
다름이 아니라 marlin 커널 과의 속도 벤치마크 가 궁금해져서 질문드려봅니다.
[marlin 커널](https://github.com/IST-DASLab/marlin)은 4bit cuda kernel중 하나이며 매우 최적화 되어 있다고 주장합니다.
혹시 이 커널과 벤치마크하여 비교해주실수 …