-
While running this in Google Colab I get the following error: I am using the pro version of Google Collab.
XlaRuntimeError Traceback (most recent call last)
[](https://lo…
-
Andrej Karpathy has just ~upstaged me~ released llm.c which contains some highly optimised CUDA kernels. If we include these into tricycle, we can probably get a significant performance boost for oper…
-
running integrity checks
No embeddings have been generated for _utils
datafolder is: E:\lh\SpaceRL-KG-master/datasets/COUNTRIES
generating embeddings for dataset COUNTRIES and models ['TransE_l2']
…
-
### Motivation and description
Wondering what kind of speedup can be achieved by writing GPU kernels for optimizers.
Take a look at @pxl-th's implementation of Adam below
https://github.com/Jul…
-
I'd like to add support for `rust-gpu` in the not-so-distant future. I have some questions while I figure out the plan:
1. Would it make sense to have shaders written with `rust-gpu` to be hung off…
-
Hello,
I'm trying to compare training speed between using 1 node and using 2 nodes (one GPU per node).
From 1 node training, back-propagation (calculate gradients & update parameters) takes abo…
-
## Description
Consider adding additional FusedCrossEntropyLoss kernel to FOAK set of kernels given the additional improvement seen using it in earlier tests (See Background below).
Considerati…
-
### Describe the feature
https://github.com/linkedin/Liger-Kernel
Liger Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU train…
-
Hi,
I've been testing trilinos and came across a broken kk unit tests on h100s w/ cuda 12.4. I have not tried to reproduce the broken test stand alone but figured I'd report it. See configuration 1…
-
```
#include
#include "hip/hip_runtime.h"
// 1. if N is set to up to 1024, then sum is OK.
// 2. Set N past the 1024 which is past No. of threads per blocks, and then all iterations of sum resu…