NolanoOrg / cformers

SoTA Transformers with C-backend for fast inference on your CPU.
MIT License
311 stars 29 forks source link

Towards a C++ library #36

Open A2va opened 1 year ago

A2va commented 1 year ago

Development of roadmap ideas:

Restructure the codebase to reuse. Switch to Pybind11 rather than Subprocess - expected speedup: 3-4x

I'm particularly interested in this project for a C++ library. This could allow to multiple project to use the code. I feel like this worth mentioning CTranslate2 which is a C++ library for mainly translator transformers but also does text generation with BLOOM or OPT.

Anyway, I found the current project structure not very practical for this task. So I propose to move all the C files to a src and include directories in the root folder of the repo. This would allow simplifying the usage/compilation of the C backend.

Currently, the model is loaded every time a prompt is submitted, which slows down the process. Thus, instead of using an executable program, an API could be used along with pybind11 to enhance the performance of the model. That API might look something like this:

bloom_load(model_path) -> load the weights, return an pointer to the model ctx.
bloom_eval(ctx_pointer) -> inference, return the output
bloom_free(ctx_pointer) -> free the model context
Ayushk4 commented 1 year ago

I 100% agree with this. This is also what I intend for this project to be.

Loading once should be enough, and there must be an option to cache key & value cache to avoid re-computation in a multi-turn (chat-style) mode.

A2va commented 1 year ago

A lot has happened on the llama.cpp repo:

Currently, in the cformers repo there is only one Makefile for build, which is only supported on POSIX systems. We could add a CMakeLists like in the llama.cpp repo. But these build files to maintain for different OS. I found not that practical, and I used for some time XMake, it's an alternative to CMake with lua scripting.

Small example (already working with cformers code):

add_rules("mode.debug", "mode.release")
set_languages("cxx11", "c11")

target("cformers")
    set_kind("$(kind)")
    set_default(true)

    add_files("src/**.cpp")
    add_files("src/**.c")

    if is_plat("linux") then 
        add_syslinks("pthread")
        add_cflags("-D_POSIX_C_SOURCE=199309L")
    end

    add_headerfiles("include/**.h")
    add_includedirs("include", {public = true})

target("quantize_bloom")
    set_kind("binary")
    add_files("quantize/quantize_bloom.cpp")
    add_deps("cformers")

The quantization program could be run with:xmake run quantize_bloom arg1 arg2. You do not need to invoke it from the location of the executable.

It has another advantages, which is it can use package (700+ on xrepo). I noticed that llama.cpp supports OpenBLAS so with xmake it could be like this:

add_requires("openblas")

target("ggml")
    set_kind("static")
    add_packages("openblas")

I know that it can be difficult to start with a new tool, but I feel like it's easier to get started with than CMake. It's really a pleasure to working with. I have a setup script with 170 lines of code, it downloads some models convert them to their C++ version, install python, ...

What do you think ?

mallorbc commented 1 year ago

This project seems to use pybindings to not have to load the model into memory each time. Taking inspiration from the work there may be a good idea. https://github.com/nomic-ai/pyllamacpp