Open PABannier opened 1 year ago
great works, have you considered rwkv.cpp which also a form of RNN, from my limited knowledge I think it's superior than LSTM.
Maybe we should refactor ggml_cgraph
to allow to dynamically allocate nodes. Something like this:
struct ggml_cgraph {
int n_nodes;
int n_leafs;
struct ggml_tensor * nodes_stack[GGML_MAX_NODES];
struct ggml_tensor * grads_stack[GGML_MAX_NODES];
struct ggml_tensor * leafs_stack[GGML_MAX_NODES];
// by default we work on the stack
struct ggml_tensor ** nodes = nodes_stack;
struct ggml_tensor ** grads = grads_stack;
struct ggml_tensor ** leafs = leafs_stack;
void * visited_hash_table[GGML_GRAPH_HASHTABLE_SIZE];
// performance
int perf_runs;
int64_t perf_cycles;
int64_t perf_time_us;
};
When we allocate the graph on the stack, initialize nodes
, grads
and leafs
to point to the stack arrays.
When we allocate it on the heap, we allow to pass arbitrary number of nodes which we malloc
and redirect nodes
, grads
and leafs
to them.
The reason I am thinking about something like this is because I want to keep the option have the graph fully allocated on the stack when we can fit it in GGML_MAX_NODES
.
Any other suggestions? cc @slaren
This is probably the best we can do for now. An improvement would be to move the storage to the end of the struct, so that the space reserved for the stack can still be used when allocated on the heap. Ie:
struct ggml_cgraph {
int n_nodes;
int n_leafs;
// by default we work on the stack
struct ggml_tensor ** nodes = storage + 0;
struct ggml_tensor ** grads = storage + GGML_MAX_NODES;
struct ggml_tensor ** leafs = storage + GGML_MAX_NODES * 2;
void ** visited_hash_table = storage + GGML_MAX_NODES * 3;
// performance
int perf_runs;
int64_t perf_cycles;
int64_t perf_time_us;
// may be larger than this when allocated on the heap
struct ggml_tensor * storage[GGML_MAX_NODES * 3 + GGML_GRAPH_HASHTABLE_SIZE];
};
The hash table must also grow to hold all the nodes and leafs, and the size should be a prime number, otherwise the number of collisions will be huge and the performance will be affected.
In the long term, I hope we will be able to remove the node limits from the computation graphs and calculate the required memory automatically, but that's probably going to require a larger redesign.
Looking at the rwkv
repo, I think the easiest solution for now is to move the computational graph of Encodec
on the heap then. I have skimmed through the code and it seems like they are building a computational graph per time point.
You can take a look on my work. Basically, the computation graph (a GRU in this case) includes only the forward pass of 1 cell. To loop through the entire sequence, at each timestep, we reset the input and call the ggml_graph_compute_with_ctx()
https://github.com/datduonguva/ggml-experiments/blob/master/rnn_text_gen/rnn_text_generation.cpp
Hope this helps! (I am very new to the project so my approach might not be correct).
Hello everyone!
I'm currently enhancing the GGML implementation of a LSTM network. My main focus is to avoid having scalability issues with the computational graph.
Currently I'm setting GGML_MAX_NODES to a very high value (100k): https://github.com/PABannier/bark.cpp/blob/main/ggml.h#L206C32-L206C38
This is due to the fact that for the LSTM network the number of nodes in the computational graph grows with the sequence length: https://github.com/PABannier/bark.cpp/blob/main/encodec.cpp#L81C25-L81C36 .
I wanted a quick fix in order to have a first POC of bark.cpp. Now that we want to clean things, I'm wondering what's the best solution?
I was thinking if we should not create a ggml context and computational graph per time point to avoid these scalability issues in the graph. One subgraph would be created per time point, run the forward pass and obtain the result of one cell. It feels hacky and quite costly considering the overhead of building the graph, copying tensors from one context to another, etc.
What do you think would be the best solution?