-
The following test fails currently:
```c++
TEST_F(MatmulSchedulerTest, SelfMappingErrorSmemEpilogue1dBias) {
NVFUSER_TEST_CUDA_ARCH_RANGE_GUARD(7, 5, 9, 0);
Fusion fusion_obj;
Fusion* fusion = …
-
Currently having issues attempting to quantize, save, then load the model using HF Transformers.
Is there any known working method for quantizing Aria (preferably to 4bit)?
-
1. In some projects, Gemm+AllReduce needs to be used. I would like to know whether Gemm+AllReduce can be implemented? and possible methods and issues. thanks.
#7
-
@efrantar
Awesome work -- always enjoy your research on and implementation of efficient model inference.
I was hoping that you could shed some light on the logic of the [packing](https://github…
-
Hello, I am currently using auto_scheduler to automatically tune a naive gemm operator. However, after the tuning is completed, I checked the corresponding assembly code and found that the registers r…
-
# Summary
I believe there are some missing gemm_batch implementations, looking at the oneMKL docs it seems this should support. A `gemm_batch` with, two half matrices as input, a float matrix out, an…
-
**What is your question?**
Trying to understand the behavior of Gemm with a column-broadcasted bias vector epilogue.
When defining a device `GemmUniversalWithBroadcast` with the following config:
…
-
**What is your question?**
Hey folks,
I am having a hard time understanding the following problem.
I exported a PyTorch extension using the following code:
```python
dtype = torch.int32
type_A = to…
-
I want to use te's comm-gemm-overlap module to perform multi-node training, however the readme says this module only support single node. Does te have any plan for multi nodes support? And what effort…
-
**Describe the bug**
When writing a TF/keras model trained w/ with F64, tf2onnx warns about a lack of float64 support for GEMM by the runtime:
```
onnx_model, _ = tf2onnx.convert.from_keras(m…