-
I tried to extract a LoRA from `Xwin-LM/Xwin-Math-70B-V1.1` and got this:
```
delta_weight = new_weight - base_weight
~~~~~~~~~~~^~~~~~~~~~~~~
RuntimeError: The size of tensor…
-
-
##### Purpose
Introducing a secondary tensor operation DSL (Domain Specific Language) written in & optimised for Scala language & various compilers (the most common of which are JVM based scalac 2.…
-
Hi there,
I've used Megatron to train 13B gpt model on a H100 machine.
Before I use fp8 transformer engine, the speed of the training is about 0.34s/step.
After I enabled the fp8 transformer engi…
-
## 🐛 Bug
Currently, one can construct a CSR tensor that has equal column indices in the same row. In principle, this corresponds to "uncoalesced CSR tensor" that we are not supposed to have. In…
pearu updated
3 years ago
-
## 🐛 Bug
torch.topk with sorted=True doesn't return a result that is consistent across different values of k when dealing with duplicates values. The position of duplicated values in the returned s…
-
a nitpick: the weights vector is summed up and tested for summing to 1. if you're gonna sum it up anyway, why not allow arbitrary positive weights and normalize the weight vector?
oml/src/lib/stats/…
-
Supporting unevaluated operations like Mul(3, 4, evaluate=False) occasions a lot of headaches (for instance issue #5783 ). I think that the root cause of this is that we try to represent 2 very differ…
rlamy updated
2 years ago
-
**Original report ([archived issue](https://osrf-migration.github.io/sdformat-gh-pages/#!/osrf/sdformat/issues/95)) by John Hsu (Bitbucket: [hsu](https://bitbucket.org/%7B0a186eae-abf0-4514-a951-23db5…
-
[The multi-query attention paper](https://arxiv.org/pdf/1911.02150.pdf) reports up to 10x speed-ups compared to incremental decoding with multi-head attention model. We've implemented multi-query atte…