Blas-like Prompt Parallelization? (sequence processing mode)

RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model

MIT License

1.42k stars 98 forks source link

Closed paryska99 closed 1 year ago

paryska99 commented 1 year ago

Is it possible to make prompt processing faster with help of a gpu device, just like CuBLAS or ClBlast can with CPU hosted Llama models or other?

saharNooby commented 1 year ago

It is possible, but would require implementing sequence processing mode. Currently, only RNN mode is implemented, that is, processing token-by-token.