RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.37k stars 90 forks source link

file parsing and memory usage optimization #74

Closed LoganDark closed 1 year ago

LoganDark commented 1 year ago

This PR reworks the file parsing and the memory usage estimates to make the library a bit more efficient in those respects and also pave the way towards more interesting things like new types of loading (mmap, quantize on load, load from PTH). It also reworks the type system a bit to simplify changing it up if the file format ever needs another update.

The computation graph creation has been separated out and memory estimates for it have been added so that the perfect amount of memory can be allocated each time. And the mysterious computation graph work tensor has been understood well enough to add an accurate upper bound.

This is because I plan to implement sequence mode and having a tight understanding and control of memory usage is crucial for implementing it properly. I don't want to be wasteful on large or small models, or undershoot and crash the application.

Also, the quantization has been optimized a bit and a cuBLAS offloading function has been added, to preserve backwards compatibility. The signature of functions with the same name cannot change, so adding a new parameter to rwkv_init_from_file is a bad idea.

This PR works with both the Raven family of models as well as the brand new World family which honestly makes me really excited.

Try to keep up :)

saharNooby commented 1 year ago

If you expect something to break, I strongly suggest adding new tests for new behaviors. tiny-rwkv can be used as a test model, it's already in the repo.

LoganDark commented 1 year ago

so maintaining the fstat, fseek etc. defines is getting weird. Microsoft is special and refuses to offer library functions that aren't prefixed by an underscore, Apple is special and only offers fstato which works with the width of off_t (which you can only make 64-bit globally), but they removed everything with ASi and only offer regular old fstat which is 64-bit now. and Linux is at least straightforward, and has about 3 different methods (with no documentation as to which one is current) to enable what they call "Large File Support".

I have the resources to look into this, I have multiple versions of Windows, multiple installations of Linux and an Intel Mac (and I can make up for the ASi Macs easily). But it's mildly infuriating. I will just put in the effort now and try to get it all in one go

LoganDark commented 1 year ago

I'm super proud of how smooth it was to update ggml with the new type system, definitely seems to be working as intended.

OK, so here is what appears to be working as intended in this PR :

I am working on Q8_1 quantization right now (it is a somewhat special case; I hope it will be worth it)

LoganDark commented 1 year ago

I am working on Q8_1 quantization right now (it is a somewhat special case; I hope it will be worth it)

So I successfully implemented a quantizer for it, but GGML actually does not support vec_dot_q for Q8_1 tensors, which means trying to run inference on such a model results in a NULL pointer dereference at runtime. https://github.com/ggerganov/ggml/issues/200

Anyway, I removed the data type as it's not usable, but left the quantizer in place so that the type can be re-enabled once GGML supports the proper operators to perform inference on a Q8_1 quantized model.

ggerganov commented 1 year ago

Q8_1 is not supported because it does not improve the accuracy compared to Q8_0 in any measurable way

LoganDark commented 1 year ago

Q8_1 is not supported because it does not improve the accuracy compared to Q8_0 in any measurable way

good to know, thanks. So there really is no reason to support it as an alternative to Q8_0?

ggerganov commented 1 year ago

Yup, no point. The existing Q8_1 in the code base is actually the same as Q8_0 but with precomputed sums in the s member as an optimization technique. It is used only for intermediate quantizations during inference. But it is not used to store model data

LoganDark commented 1 year ago

Yup, no point. The existing Q8_1 in the code base is actually the same as Q8_0 but with precomputed sums in the s member as an optimization technique. It is used only for intermediate quantizations during inference. But it is not used to store model data

is it safe to assume that any quantized data type not implemented in ggml_quantize_chunk will never be usable for models?

saharNooby commented 1 year ago

@LoganDark Would be great to have summary of the changes in the PR desc, for easier review (it's okay to add it way later, when PR is out of draft status tho)

LoganDark commented 1 year ago

@LoganDark Would be great to have summary of the changes in the PR desc, for easier review (it's okay to add it way later, when PR is out of draft status tho)

the most recent changes are preparing to add the seq mode. should that be in this PR or a separate one?

this PR might be ready, just have to verify the most recent commit (I need to verify the memory usage is correct and that it is compatible with all the different model sizes)

saharNooby commented 1 year ago

@LoganDark

should that be in this PR or a separate one?

I would very much prefer it :) Reviewing large PRs is difficult and leads to procrastination on them...

LoganDark commented 1 year ago

large PRs

sorry about refactoring so much of your code each PR, it's just so filled with opportunity !! (lol)

saharNooby commented 1 year ago

it's just so filled with opportunity !! (lol)

I see what you mean lol :) C/C++ is not my native language, so yeah, not the best code in the world...

LoganDark commented 1 year ago

looks like that is the kind of reason that I needed to test this PR (quantized models were failing to load because the memory estimate was too low / did not know that ggml would use a larger format for matrix multiplies)

LoganDark commented 1 year ago

wait I did that merge very wrongly ugh stupid cli typo

LoganDark commented 1 year ago

ok it looks like the new function with gpu layers has to be renamed OR I can just add a new function that offloads specified layers onto gpu. going to do the latter

LoganDark commented 1 year ago

2: Difference sum: nan

excuse me what

edit: oh. quantization issue, gimme a bit

LoganDark commented 1 year ago

going to mark this as ready for review now, I think it is ok

LoganDark commented 1 year ago

it looks like this even supports loading and inference on the new world models (with cuBLAS), that is funny and cool

LoganDark commented 1 year ago

The original post has been edited with a description which should show up in the commit if you squash when merging

saharNooby commented 1 year ago

Overall, I have no functional concerns. My concerns are about formatting and code style:

LoganDark commented 1 year ago

error messages must start with upper-case letter

this is not documented in the code style :( I thought it was the complete opposite

keep still relevant comments that were in previous code instead of removing them

I forgot to add them back, I didn't mean to remove them, I am sorry

saharNooby commented 1 year ago

Name the state parts

It's great to see more features, but let's keep this PR as is to merge it more quickly

LoganDark commented 1 year ago

Name the state parts

It's great to see more features, but let's keep this PR as is to merge it more quickly

what do you mean merge it more quickly? you are reviewing it now, right?

saharNooby commented 1 year ago

you are reviewing it now, right?

Yes :) Looks like it's ready to be merged

LoganDark commented 1 year ago

Phew, what a relief~ The macOS build errors were making me anxious because I knew the fix was in here lol