Closed PABannier closed 1 year ago
Hi!
I realize that you're storing the computational graph on the heap to avoid stack overflows with these very large graphs
That's correct; please also note that other structures (cplan
at least) need also be allocated on the heap, because they grow with GGML_MAX_NODES
just like cgraph
s.
What is less clear is how you're making sure your computational graph is not made up of more than GGML_MAX_NODES nodes
It's simple -- I don't! I just use a fork of ggml
with GGML_MAX_NODES
changed from 4096
to 80K
. This allows inference of 14B models (maximum available size of RWKV for the moment) for sequence length 64 (it was experimentally checked by @LoganDark that after this length performance gains are minimal).
See this single commit in my fork of ggml
-- aside from upping GGML_MAX_NODES
, some APIs also need to be changed to support heap allocation.
Since ggml
is already pinned to a specific commit when used in rwkv.cpp
, by introducing a fork with a small commit I don't add much development overhead.
Do you build a computational graph per time point of the sequence?
I don't really know; sequence eval code was contributed by @LoganDark. As far as I know, for some operations ggml tensor count grows with sequence length, for other operations it stays constant (for example, there is only one op for head matmul).
I'll close the issue so it's not taking space in Open, but feel free to continue the conversation here.
Do you build a computational graph per time point of the sequence?
The computation graph is only built for the first time the sequence length changes. Then the existing sequence mode computation graph is re-used as long as the sequence length is the same.
What is less clear is how you're making sure your computational graph is not made up of more than
GGML_MAX_NODES
nodes.
we simply increase that constant before building :)
@saharNooby alright thanks for the detailed answer.
Hi @saharNooby !
I did not know where to post this message to reach you out. So I opened an issue :)
We currently have this issue on ggml (https://github.com/ggerganov/ggml/issues/467) . I'm trying to implement a LSTM layer with ggml. However, since the computational graph of the LSTM layer grows with the sequence length, I am often limited by the
GGML_MAX_NODES
constant. I was toldrwkv.cpp
implemented a serial graph which closely resembles the graph a RNN would have.Diving into your code, I realize that you're storing the computational graph on the heap to avoid stack overflows with these very large graphs. What is less clear is how you're making sure your computational graph is not made up of more than
GGML_MAX_NODES
nodes.Could you explain to me how you designed your forward pass? Do you build a computational graph per time point of the sequence?
Thanks in advance for your answer