-
In [e2e matmul tests](https://github.com/iree-org/iree/tree/main/tests/e2e/matmul) we have correctness tests for `SPIRVBaseVectorize` pipeline. We need to also add support for `SPIRVMatmulPromoteVecto…
-
@jasonmayes raises the question of [whether WebGPU exposes the right API surface needed to support ML frameworks interactions with GPUs](https://www.w3.org/2020/06/machine-learning-workshop/talks/oppo…
-
This issue has been fixed in the down stream.
https://github.com/intel/intel-xpu-backend-for-triton/pull/835
Need to upstream the fix to the upstream Triton.
-
The current state of the specification can be found here: https://github.com/IntelPython/DPPY-Spec/blob/draft/partitioned/Partitioned.md
-
SIMT deadlock example:
A: *mutex = 0·
B: while(!atomicCAS(mutex, 0· ,1));
C: // critical section
atomicExch(mutex, 0· ) ;
-
Looking into our [FlashAttention-2](https://github.com/intel/intel-xpu-backend-for-triton/blob/perf_attn/python/tutorials/06-fused-attention.forward.py) benchmark, we see that reduction codegen for Xe…
-
- [内核融合:深度学习的“加速神器](https://www.doit.com.cn/p/306236.html)
- [TVM](https://github.com/dmlc/tvm)
- [NNVM-fusion]()
- [XLA Beginner](https://github.com/TensorflowXLABeginner/XLA-Report)
- [深度学习所有硬…
-
I am reviewing https://github.com/openjournals/joss-reviews/issues/3537 and was unable to test on my Apple M1 (my machine is not setup to compiler for x86 even if Rosetta2 supports executing such bina…
-
Better performance with divergent kernels and more localized data sharing is a strength of modern SIMT hardware, making massively parallel algorithm considerations viable for more fields than before. …
-
Shape(m*k*n)
4096\*4096\*4096
8192\*8192\*8192
1024\*28672\*8192
3072\*4096\*3072