RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.41k stars 95 forks source link

[QUESTION] Implementing RNN/LSTM with ggml #136

Closed PABannier closed 1 year ago

PABannier commented 1 year ago

Hi @saharNooby !

I did not know where to post this message to reach you out. So I opened an issue :)

We currently have this issue on ggml (https://github.com/ggerganov/ggml/issues/467) . I'm trying to implement a LSTM layer with ggml. However, since the computational graph of the LSTM layer grows with the sequence length, I am often limited by the GGML_MAX_NODES constant. I was told rwkv.cpp implemented a serial graph which closely resembles the graph a RNN would have.

Diving into your code, I realize that you're storing the computational graph on the heap to avoid stack overflows with these very large graphs. What is less clear is how you're making sure your computational graph is not made up of more than GGML_MAX_NODES nodes.

Could you explain to me how you designed your forward pass? Do you build a computational graph per time point of the sequence?

Thanks in advance for your answer

saharNooby commented 1 year ago

Hi!

I realize that you're storing the computational graph on the heap to avoid stack overflows with these very large graphs

That's correct; please also note that other structures (cplan at least) need also be allocated on the heap, because they grow with GGML_MAX_NODES just like cgraphs.

What is less clear is how you're making sure your computational graph is not made up of more than GGML_MAX_NODES nodes

It's simple -- I don't! I just use a fork of ggml with GGML_MAX_NODES changed from 4096 to 80K. This allows inference of 14B models (maximum available size of RWKV for the moment) for sequence length 64 (it was experimentally checked by @LoganDark that after this length performance gains are minimal).

See this single commit in my fork of ggml -- aside from upping GGML_MAX_NODES, some APIs also need to be changed to support heap allocation.

Since ggml is already pinned to a specific commit when used in rwkv.cpp, by introducing a fork with a small commit I don't add much development overhead.

Do you build a computational graph per time point of the sequence?

I don't really know; sequence eval code was contributed by @LoganDark. As far as I know, for some operations ggml tensor count grows with sequence length, for other operations it stays constant (for example, there is only one op for head matmul).


I'll close the issue so it's not taking space in Open, but feel free to continue the conversation here.

LoganDark commented 1 year ago

Do you build a computational graph per time point of the sequence?

The computation graph is only built for the first time the sequence length changes. Then the existing sequence mode computation graph is re-used as long as the sequence length is the same.

LoganDark commented 1 year ago

What is less clear is how you're making sure your computational graph is not made up of more than GGML_MAX_NODES nodes.

we simply increase that constant before building :)

PABannier commented 1 year ago

@saharNooby alright thanks for the detailed answer.