RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.37k stars 90 forks source link

Sequence mode prototype #89

Closed LoganDark closed 1 year ago

LoganDark commented 1 year ago

This is a prototype of sequence mode.

Load model ... 1.318s Serial mode to process 30 tokens ... 2.116s Sequence mode to process 30 tokens ... 0.509s Logits total diff = 0.00000 Logits identical = TRUE

This is only for testing. It runs into precision and capacity limits at large lengths. The goal is to support sequences of up to 25k tokens.

It is also likely that the dedicated single token functions should be brought back. Again, only prototype.

LoganDark commented 1 year ago

This PR is currently dependent on https://github.com/ggerganov/ggml/pull/229 due to relying on a single computation graph. This dependency will most likely go away

LoganDark commented 1 year ago

No implementation advice is needed yet, there are still some low hanging fruit left for me to take care of.

This is just to get the news out that this is being worked on.

saharNooby commented 1 year ago

Just to be sure -- at line 910 struct ggml_tensor * x = ggml_get_rows(ctx, model.emb, token_index); we will get huge matrix of size sequence_length, n_embed?

LoganDark commented 1 year ago

Just to be sure -- at line 910 struct ggml_tensor * x = ggml_get_rows(ctx, model.emb, token_index); we will get huge matrix of size sequence_length, n_embed?

Yes, width n_embed and height sequence_len

saharNooby commented 1 year ago

Huge matmul should be fast! (some say, I myself don't know)

LoganDark commented 1 year ago

Huge matmul should be fast! (some say, I myself don't know)

OP includes a benchmark that shows 4x speedup on processing 30 tokens :)

LoganDark commented 1 year ago
Load model ... 1.397s
Serial mode to process 30 tokens ... 6.971s
Sequence mode to process 30 tokens ... 0.696s
Logits total diff = 0.00000
Logits identical = TRUE

Also, tests are passing ? Pleasant surprise (I guess I no longer use the faulty ggml function)

LoganDark commented 1 year ago

image image

looks like enabling cublas is what causes the issue, even if no tensors are offloaded/uploaded to the GPU.

offloading tensors doesn't solve the problem:

image

this looks like a ggml issue. maybe I will try to debug ggml and open a bug report

LoganDark commented 1 year ago

macos fail is probably a false positive, it works fine on my mac, but need someone on apple silicon to test

LoganDark commented 1 year ago

I think this pull request is ready for a review (not merge yet)

There are still probably some touch-ups I need to do, but the code is almost clean enough to be production ready, I think

It does not support anywhere near 25k tokens (in fact going too far above 64 will probably crash upstream builds of ggml) but it can be improved in the future

LoganDark commented 1 year ago

macos runner still being deterministically terrible. Sigh

saharNooby commented 1 year ago

Great PR! Aside from some nits, concerns are:

saharNooby commented 1 year ago

I'll test sanitizer builds on a new temporary branch logan-dark-sequence-mode, since I have no permission to push wworkflow changes directly into this PR.

saharNooby commented 1 year ago

Lol, just adding -DRWKV_SANITIZE_ADDRESS=ON to build command line fixes the issue, and gives literally zero useful info.

LoganDark commented 1 year ago

Lol, just adding -DRWKV_SANITIZE_ADDRESS=ON to build command line fixes the issue, and gives literally zero useful info.

I guess may as well leave it on since it seems to work fine on my mac. Real macs probably have AVX2 and FMA

saharNooby commented 1 year ago

I guess may as well leave it on

That's an option, but I have another idea I will try tomorrow. I will literally insert a macro at each line printing line number and see on what line it stops executing. I did that before, it is a crude way, but it works.

Then, when line is known, I can try to rewrite the part.

saharNooby commented 1 year ago

@LoganDark I've tried to debug it, but I have no ideas and/or willingness to go further. Looks like a compiler bug, IDK...

Please change the MacOS build command in build.yml to cmake -DRWKV_AVX2=OFF -DRWKV_FMA=OFF -DRWKV_SANITIZE_ADDRESS=ON .., also adding comment above: Sanitizer is enabled to fix issues discovered when testing #89. It needs to be disabled as soon as it is possible (that is, master is able to be built on MacOS GitHub runner again). I'll approve the workflow.

saharNooby commented 1 year ago

Oops, closed accidentally.

@LoganDark You've said the PR is ready for review, but not merge; is it still not ready to be merged?

LoganDark commented 1 year ago

is it still not ready to be merged?

I'll perform another pass over it personally and make sure. My main concern is that long sequence lengths build computation graphs that are so large that they cannot fit inside the ggml_cgraph struct; you'll hit ggml asserts at runtime that crash the program due to exceeding the node limit. I don't know if there's any way to fix this as long as ggml doesn't support iteration in cgraphs.

LoganDark commented 1 year ago

Yeah, 85238418d50d03d9b4f680641fd7d0057dd6be5a is what I missed, lmao!!

LoganDark commented 1 year ago

I think I'd be okay merging this since further improvements can be made in the future.

I am experimenting with making performance graphs that compare sequence mode to serial mode. Here is one in log scale

image

That drop around 32 tokens is so not an outlier:

image

I don't know what in the world cuBLAS is doing to achieve this, but I'm interested and also scared. lol

saharNooby commented 1 year ago

I have the last comment (if nothing more changes in this PR) about sequence.c file, and will be ready to merge.

BTW,

you'll hit ggml asserts at runtime that crash the program

MacOS build was failing because of random ggml asserts. Maybe something overflowed here too.

LoganDark commented 1 year ago

MacOS build was failing because of random ggml asserts. Maybe something overflowed here too.

Those are unrelated

LoganDark commented 1 year ago

For the record, it doesn't make me very happy to discard the files I used while developing this branch. Those deprive future developers of the tests that they would need to do the same work that I've done. But, there isn't a framework for those kinds of tests (except for maybe the tests directory) since they can't use the tiny 660K models, they need to be tested on real, large models.