-
### Describe the bug
Hi all, when I do some profile with open-llama-3b on Arc A770, I found in float16, aten::bmm becomes extramely slower compared to float32 (111.4ms vs 22.5ms). I wonder is this …
-
After depositing from the mainchain via the sidechains tab to the sidechain address shown in the mainchain>transfer tab the deposit does not get available/visible at all in the sidechain client, even …
-
I try to realise the FA loss after your answers。But I met some questions in relation graph 。
my test code is
x = np.random.random((256, 64, 64))
y = np.random.random((256, 64, 64))
y = torch.from…
-
I am trying dynamic quantization for Hugging face T5-small model in graviton3 .I have used
``` torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8) ```
In…
-
First question: Don't understand why points are added below:
`pred = torch.add(torch.bmm(model_points, base), points + pred_t)`
The second question is about iterative optimization. Why is the follow…
-
I found log-bmm very useful for linear-chain CRF to save memory and speed up, while in context-free grammars, A->BC requires amounts of GPU memories, which is more serious. So it is difficult to incre…
-
The output of profile bandwidth is as follows:
size: 0.25 MB, gpu-to-cpu bandwidth: 5.505 GB/s
size: 32.00 MB, gpu-to-cpu bandwidth: 13.220 GB/s
size: 128.00 MB, gpu-to-cpu bandwidth: 13.324 GB/…
xvanQ updated
5 months ago
-
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:1! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
-
Failure:
```
- RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/sharded/sharded_op.cpp:42: this->grid_size.x grid_size.y
-
What if I have a set of matrices instead of a set of vectors? Is it possible to extend the Set Transformer framework to cover that scenario?
I played around with it a little (including making some …