ggerganov / ggml

Tensor library for machine learning
MIT License
11.22k stars 1.04k forks source link

Computational graph grows with the sequence length #467

Open PABannier opened 1 year ago

PABannier commented 1 year ago

Hello everyone!

I'm currently enhancing the GGML implementation of a LSTM network. My main focus is to avoid having scalability issues with the computational graph.

Currently I'm setting GGML_MAX_NODES to a very high value (100k): https://github.com/PABannier/bark.cpp/blob/main/ggml.h#L206C32-L206C38

This is due to the fact that for the LSTM network the number of nodes in the computational graph grows with the sequence length: https://github.com/PABannier/bark.cpp/blob/main/encodec.cpp#L81C25-L81C36 .

I wanted a quick fix in order to have a first POC of bark.cpp. Now that we want to clean things, I'm wondering what's the best solution?

I was thinking if we should not create a ggml context and computational graph per time point to avoid these scalability issues in the graph. One subgraph would be created per time point, run the forward pass and obtain the result of one cell. It feels hacky and quite costly considering the overhead of building the graph, copying tensors from one context to another, etc.

What do you think would be the best solution?

lin72h commented 1 year ago

great works, have you considered rwkv.cpp which also a form of RNN, from my limited knowledge I think it's superior than LSTM.

ggerganov commented 1 year ago

Maybe we should refactor ggml_cgraph to allow to dynamically allocate nodes. Something like this:

    struct ggml_cgraph {
        int n_nodes;
        int n_leafs;

        struct ggml_tensor * nodes_stack[GGML_MAX_NODES];
        struct ggml_tensor * grads_stack[GGML_MAX_NODES];
        struct ggml_tensor * leafs_stack[GGML_MAX_NODES];

        // by default we work on the stack
        struct ggml_tensor ** nodes = nodes_stack;
        struct ggml_tensor ** grads = grads_stack;
        struct ggml_tensor ** leafs = leafs_stack;

        void * visited_hash_table[GGML_GRAPH_HASHTABLE_SIZE];

        // performance
        int     perf_runs;
        int64_t perf_cycles;
        int64_t perf_time_us;
    };

When we allocate the graph on the stack, initialize nodes, grads and leafs to point to the stack arrays. When we allocate it on the heap, we allow to pass arbitrary number of nodes which we malloc and redirect nodes, grads and leafs to them.

The reason I am thinking about something like this is because I want to keep the option have the graph fully allocated on the stack when we can fit it in GGML_MAX_NODES.

Any other suggestions? cc @slaren

slaren commented 1 year ago

This is probably the best we can do for now. An improvement would be to move the storage to the end of the struct, so that the space reserved for the stack can still be used when allocated on the heap. Ie:

    struct ggml_cgraph {
        int n_nodes;
        int n_leafs;

        // by default we work on the stack
        struct ggml_tensor ** nodes = storage + 0;
        struct ggml_tensor ** grads = storage + GGML_MAX_NODES;
        struct ggml_tensor ** leafs = storage + GGML_MAX_NODES * 2;

        void ** visited_hash_table = storage + GGML_MAX_NODES * 3;

        // performance
        int     perf_runs;
        int64_t perf_cycles;
        int64_t perf_time_us;

        // may be larger than this when allocated on the heap
        struct ggml_tensor * storage[GGML_MAX_NODES * 3 + GGML_GRAPH_HASHTABLE_SIZE];
    };

The hash table must also grow to hold all the nodes and leafs, and the size should be a prime number, otherwise the number of collisions will be huge and the performance will be affected.

In the long term, I hope we will be able to remove the node limits from the computation graphs and calculate the required memory automatically, but that's probably going to require a larger redesign.

PABannier commented 1 year ago

Looking at the rwkv repo, I think the easiest solution for now is to move the computational graph of Encodec on the heap then. I have skimmed through the code and it seems like they are building a computational graph per time point.

datduonguva commented 1 year ago

You can take a look on my work. Basically, the computation graph (a GRU in this case) includes only the forward pass of 1 cell. To loop through the entire sequence, at each timestep, we reset the input and call the ggml_graph_compute_with_ctx()

https://github.com/datduonguva/ggml-experiments/blob/master/rnn_text_gen/rnn_text_generation.cpp

Hope this helps! (I am very new to the project so my approach might not be correct).