-
This is for the CUDA version. If the CUDA kernel launch fails, the results will fail validation but still be included in the results, so the "best gflop/s" will be too big since the kernel time was ve…
-
Post #370, akin to kernel / likelihood quadrature computation, would be good to have on the top level default methods for these defined on the `Prior` that say what solver and root method we'd like to…
-
* research if/how these are implemented in each backend
* research the different methodologies to assess scope
- enumerate strategies e.g. Basis, Amplitude, etc
- document constraints e.g. …
-
I will work on this in the branch for #129. This issue is to document all ideas for asynchronous GPU execution, allowing GPU and CPU computation simultaneously.
-
## TODO
- [x] Optimize JIT, fix memory planner #193
- [x] Complete test-suite/test-dynamic-shape.lisp
- [x] More tests on the JIT kernel accuracy (compared to PyTorch, like Multi Head Attention an…
-
I ran the program on an x86 machine using oneDNN as the backend library and on an ARM machine using the default library. The TensorBoard profiling data shows blank waiting times on the ARM machine, wh…
-
## Description
Consider adding additional FusedCrossEntropyLoss kernel to FOAK set of kernels given the additional improvement seen using it in earlier tests (See Background below).
Considerati…
-
CUDA events suffer from low accuracy and include the kernel launch overhead. On the other hand, CUPTI provides a more reliable way to get consistent timing measurement.
This request asks to add an op…
-
I was just thinking about this idea, so writing it down for future research.
We should be able to fairly easy generate model-specific Metal code that has hardcoded kernels for every single node in …
-
I am follwing the [instructions in the Llama2 README](https://github.com/pytorch/executorch/blob/d9aeca556566104c2594ec482a673b9ec5b11390/examples/models/llama2/README.md#instructions) to test llama m…