One of the major demons I fought while working on https://github.com/saharNooby/rwkv.cpp/pull/74 is ggml's mysterious computation graph work tensor, that is allocated the first time ggml_graph_compute is called. I was trying to perfectly estimate memory usage of the graph, so I manually counted objects and calls to ggml functions while the graph was being built. But when I had gotten the memory usage perfectly down to the last byte, ggml_graph_compute just tries to allocate a totally arbitrary amount of memory.
If ggml provides a library function to estimate the size of the computation graph work tensor, then instead of guessing, I can call that function and then allocate a new scratch to contain it. It's slightly less optimal than doing it during context construction, but at that point I don't have a context or a graph yet, and can't get one yet because it requires memory (go figure).
It would also be nice if I could tell ggml to allocate that work tensor early without having to actually do any graph computation.
I agree - the current creation of the "work" tensor by ggml_graph_compute() is a bad design decision.
I also had trouble with it recently. Will fix this
One of the major demons I fought while working on https://github.com/saharNooby/rwkv.cpp/pull/74 is ggml's mysterious computation graph work tensor, that is allocated the first time
ggml_graph_compute
is called. I was trying to perfectly estimate memory usage of the graph, so I manually counted objects and calls to ggml functions while the graph was being built. But when I had gotten the memory usage perfectly down to the last byte,ggml_graph_compute
just tries to allocate a totally arbitrary amount of memory.I didn't want to over-estimate for smaller models or especially under-estimate for larger models. It took a while to debug which tensor was the culprit (the largest mat-mul)—I hardcoded the dimensions of this tensor to estimate the upper bound for the computation graph work tensor, but this is not a great solution.
If ggml provides a library function to estimate the size of the computation graph work tensor, then instead of guessing, I can call that function and then allocate a new scratch to contain it. It's slightly less optimal than doing it during context construction, but at that point I don't have a context or a graph yet, and can't get one yet because it requires memory (go figure).
It would also be nice if I could tell ggml to allocate that work tensor early without having to actually do any graph computation.