Closed kroggen closed 1 year ago
@kroggen , please check the latest implementation in this branch: https://github.com/ankan-ban/llama2.cu/tree/opt It fixes the inefficient memory access issue.
Cool!
But not so easy to understand
The main branch is better for learning purposes
Have you benchmarked the 2 branches?
I made a test with this code but the output is not OK
I suspect it is because it has way more conversions of FP32 and FP16