Closed LoganDark closed 1 year ago
This PR is currently dependent on https://github.com/ggerganov/ggml/pull/229 due to relying on a single computation graph. This dependency will most likely go away
No implementation advice is needed yet, there are still some low hanging fruit left for me to take care of.
This is just to get the news out that this is being worked on.
Just to be sure -- at line 910 struct ggml_tensor * x = ggml_get_rows(ctx, model.emb, token_index);
we will get huge matrix of size sequence_length, n_embed
?
Just to be sure -- at line 910
struct ggml_tensor * x = ggml_get_rows(ctx, model.emb, token_index);
we will get huge matrix of sizesequence_length, n_embed
?
Yes, width n_embed and height sequence_len
Huge matmul should be fast! (some say, I myself don't know)
Huge matmul should be fast! (some say, I myself don't know)
OP includes a benchmark that shows 4x speedup on processing 30 tokens :)
Load model ... 1.397s
Serial mode to process 30 tokens ... 6.971s
Sequence mode to process 30 tokens ... 0.696s
Logits total diff = 0.00000
Logits identical = TRUE
Also, tests are passing ? Pleasant surprise (I guess I no longer use the faulty ggml function)
looks like enabling cublas is what causes the issue, even if no tensors are offloaded/uploaded to the GPU.
offloading tensors doesn't solve the problem:
this looks like a ggml issue. maybe I will try to debug ggml and open a bug report
macos fail is probably a false positive, it works fine on my mac, but need someone on apple silicon to test
I think this pull request is ready for a review (not merge yet)
There are still probably some touch-ups I need to do, but the code is almost clean enough to be production ready, I think
It does not support anywhere near 25k tokens (in fact going too far above 64 will probably crash upstream builds of ggml) but it can be improved in the future
macos runner still being deterministically terrible. Sigh
Great PR! Aside from some nits, concerns are:
I'll test sanitizer builds on a new temporary branch logan-dark-sequence-mode, since I have no permission to push wworkflow changes directly into this PR.
Lol, just adding -DRWKV_SANITIZE_ADDRESS=ON
to build command line fixes the issue, and gives literally zero useful info.
Lol, just adding
-DRWKV_SANITIZE_ADDRESS=ON
to build command line fixes the issue, and gives literally zero useful info.
I guess may as well leave it on since it seems to work fine on my mac. Real macs probably have AVX2 and FMA
I guess may as well leave it on
That's an option, but I have another idea I will try tomorrow. I will literally insert a macro at each line printing line number and see on what line it stops executing. I did that before, it is a crude way, but it works.
Then, when line is known, I can try to rewrite the part.
@LoganDark I've tried to debug it, but I have no ideas and/or willingness to go further. Looks like a compiler bug, IDK...
Please change the MacOS build command in build.yml
to cmake -DRWKV_AVX2=OFF -DRWKV_FMA=OFF -DRWKV_SANITIZE_ADDRESS=ON ..
, also adding comment above: Sanitizer is enabled to fix issues discovered when testing #89. It needs to be disabled as soon as it is possible (that is, master is able to be built on MacOS GitHub runner again)
. I'll approve the workflow.
Oops, closed accidentally.
@LoganDark You've said the PR is ready for review, but not merge; is it still not ready to be merged?
is it still not ready to be merged?
I'll perform another pass over it personally and make sure. My main concern is that long sequence lengths build computation graphs that are so large that they cannot fit inside the ggml_cgraph struct; you'll hit ggml asserts at runtime that crash the program due to exceeding the node limit. I don't know if there's any way to fix this as long as ggml doesn't support iteration in cgraphs.
Yeah, 85238418d50d03d9b4f680641fd7d0057dd6be5a is what I missed, lmao!!
I think I'd be okay merging this since further improvements can be made in the future.
I am experimenting with making performance graphs that compare sequence mode to serial mode. Here is one in log scale
That drop around 32 tokens is so not an outlier:
I don't know what in the world cuBLAS is doing to achieve this, but I'm interested and also scared. lol
I have the last comment (if nothing more changes in this PR) about sequence.c
file, and will be ready to merge.
BTW,
you'll hit ggml asserts at runtime that crash the program
MacOS build was failing because of random ggml asserts. Maybe something overflowed here too.
MacOS build was failing because of random ggml asserts. Maybe something overflowed here too.
Those are unrelated
For the record, it doesn't make me very happy to discard the files I used while developing this branch. Those deprive future developers of the tests that they would need to do the same work that I've done. But, there isn't a framework for those kinds of tests (except for maybe the tests
directory) since they can't use the tiny 660K models, they need to be tested on real, large models.
This is a prototype of sequence mode.
Load model ... 1.318s Serial mode to process 30 tokens ... 2.116s Sequence mode to process 30 tokens ... 0.509s Logits total diff = 0.00000 Logits identical = TRUE
This is only for testing. It runs into precision and capacity limits at large lengths. The goal is to support sequences of up to 25k tokens.
It is also likely that the dedicated single token functions should be brought back. Again, only prototype.