chris-ch / llama2-haskell-inference

Haskell version of llama2.c
MIT License
3 stars 0 forks source link

Use accelerate, streamly #1

Open tkvogt opened 1 month ago

tkvogt commented 1 month ago

This is a really nice project. Have you considered using https://github.com/AccelerateHS/accelerate? Another idea would be to stream the model, because the loading from file into memory already uses too much memory and it crashes for a 4GB model. I had a similar problem, where I needed to fill a judy array from a file and came to the conclusion that I have to stream the file into the judy array, see https://github.com/tkvogt/streamly-judy. I would like to work on this.

chris-ch commented 1 month ago

Thanks a lot, indeed there is definitely room for improvement ... I realised too late that Data.Vector.Storable (https://hackage.haskell.org/package/vector-0.13.1.0/docs/Data-Vector-Storable.html) should perform way better when updating the "AttentionKV" state ... Because I suspect memory allocation/de-allocation are wasting too much time. I tried, but with no luck on the branch: codespace-potential-bassoon-67rv7g9grw2rrrx. Memory still blows up for even relatively small models (100m).

Anyway, I am at the limit of what I can do in Haskell, so please if you have time go ahead. I really believe we should be able to get close to C performance-wise.