Closed LoganDark closed 1 year ago
If you expect something to break, I strongly suggest adding new tests for new behaviors. tiny-rwkv
can be used as a test model, it's already in the repo.
so maintaining the fstat, fseek etc. defines is getting weird. Microsoft is special and refuses to offer library functions that aren't prefixed by an underscore, Apple is special and only offers fstato
which works with the width of off_t
(which you can only make 64-bit globally), but they removed everything with ASi and only offer regular old fstat
which is 64-bit now. and Linux is at least straightforward, and has about 3 different methods (with no documentation as to which one is current) to enable what they call "Large File Support".
I have the resources to look into this, I have multiple versions of Windows, multiple installations of Linux and an Intel Mac (and I can make up for the ASi Macs easily). But it's mildly infuriating. I will just put in the effort now and try to get it all in one go
I'm super proud of how smooth it was to update ggml with the new type system, definitely seems to be working as intended.
OK, so here is what appears to be working as intended in this PR :
I am working on Q8_1 quantization right now (it is a somewhat special case; I hope it will be worth it)
I am working on Q8_1 quantization right now (it is a somewhat special case; I hope it will be worth it)
So I successfully implemented a quantizer for it, but GGML actually does not support vec_dot_q
for Q8_1
tensors, which means trying to run inference on such a model results in a NULL pointer dereference at runtime. https://github.com/ggerganov/ggml/issues/200
Anyway, I removed the data type as it's not usable, but left the quantizer in place so that the type can be re-enabled once GGML supports the proper operators to perform inference on a Q8_1 quantized model.
Q8_1
is not supported because it does not improve the accuracy compared to Q8_0
in any measurable way
Q8_1
is not supported because it does not improve the accuracy compared toQ8_0
in any measurable way
good to know, thanks. So there really is no reason to support it as an alternative to Q8_0?
Yup, no point. The existing Q8_1
in the code base is actually the same as Q8_0
but with precomputed sums in the s
member as an optimization technique. It is used only for intermediate quantizations during inference. But it is not used to store model data
Yup, no point. The existing
Q8_1
in the code base is actually the same asQ8_0
but with precomputed sums in thes
member as an optimization technique. It is used only for intermediate quantizations during inference. But it is not used to store model data
is it safe to assume that any quantized data type not implemented in ggml_quantize_chunk
will never be usable for models?
@LoganDark Would be great to have summary of the changes in the PR desc, for easier review (it's okay to add it way later, when PR is out of draft status tho)
@LoganDark Would be great to have summary of the changes in the PR desc, for easier review (it's okay to add it way later, when PR is out of draft status tho)
the most recent changes are preparing to add the seq mode. should that be in this PR or a separate one?
this PR might be ready, just have to verify the most recent commit (I need to verify the memory usage is correct and that it is compatible with all the different model sizes)
@LoganDark
should that be in this PR or a separate one?
I would very much prefer it :) Reviewing large PRs is difficult and leads to procrastination on them...
large PRs
sorry about refactoring so much of your code each PR, it's just so filled with opportunity !! (lol)
it's just so filled with opportunity !! (lol)
I see what you mean lol :) C/C++ is not my native language, so yeah, not the best code in the world...
looks like that is the kind of reason that I needed to test this PR (quantized models were failing to load because the memory estimate was too low / did not know that ggml would use a larger format for matrix multiplies)
wait I did that merge very wrongly ugh stupid cli typo
ok it looks like the new function with gpu layers has to be renamed OR I can just add a new function that offloads specified layers onto gpu. going to do the latter
2: Difference sum: nan
excuse me what
edit: oh. quantization issue, gimme a bit
going to mark this as ready for review now, I think it is ok
it looks like this even supports loading and inference on the new world models (with cuBLAS), that is funny and cool
The original post has been edited with a description which should show up in the commit if you squash when merging
Overall, I have no functional concerns. My concerns are about formatting and code style:
error messages must start with upper-case letter
this is not documented in the code style :( I thought it was the complete opposite
keep still relevant comments that were in previous code instead of removing them
I forgot to add them back, I didn't mean to remove them, I am sorry
It's great to see more features, but let's keep this PR as is to merge it more quickly
It's great to see more features, but let's keep this PR as is to merge it more quickly
what do you mean merge it more quickly? you are reviewing it now, right?
you are reviewing it now, right?
Yes :) Looks like it's ready to be merged
Phew, what a relief~ The macOS build errors were making me anxious because I knew the fix was in here lol
This PR reworks the file parsing and the memory usage estimates to make the library a bit more efficient in those respects and also pave the way towards more interesting things like new types of loading (mmap, quantize on load, load from PTH). It also reworks the type system a bit to simplify changing it up if the file format ever needs another update.
The computation graph creation has been separated out and memory estimates for it have been added so that the perfect amount of memory can be allocated each time. And the mysterious computation graph work tensor has been understood well enough to add an accurate upper bound.
This is because I plan to implement sequence mode and having a tight understanding and control of memory usage is crucial for implementing it properly. I don't want to be wasteful on large or small models, or undershoot and crash the application.
Also, the quantization has been optimized a bit and a cuBLAS offloading function has been added, to preserve backwards compatibility. The signature of functions with the same name cannot change, so adding a new parameter to
rwkv_init_from_file
is a bad idea.This PR works with both the Raven family of models as well as the brand new World family which honestly makes me really excited.
Try to keep up :)