UbiquitousLearning / mllm

Fast Multimodal LLM on Mobile Devices
https://ubiquitouslearning.github.io/mllm_website
MIT License
537 stars 60 forks source link

Xnnpack backend support #159

Closed chenghuaWang closed 3 weeks ago

chenghuaWang commented 1 month ago

!!!Do not merge until xnnpack backend llama is runable!!!

How to use xnnpack backend in mllm

The xnnpack backend in MLLM offers a convenient wrapper function designed to convert a standard CPU-based MLLM module into one that utilizes the xnnpack backend. This function, wrap2xnn, accepts parameters such as inputs_nums, outputs_nums, and any other arguments required for the construction of a LinearModule. For a clearer understanding, please refer to the example provided below:

E.g.:

class LinearModule : public Module {
    Layer linear;

public:
    LinearModule() {
        linear = Linear(1024, 2048, true, "linear");
    }

    vector<Tensor> Forward(vector<Tensor> inputs, vector<std::any> args) override {
        auto x = inputs[0];
        auto out = linear(x);
        return {out};
    }
};

TEST(XpLinearTest, LinearModule) {
    mllm::xnnpack::Log::log_level = mllm::xnnpack::Log::ERROR;

    auto model = ::mllm::xnnpack::wrap2xnn<LinearModule>(1, 1);
    model.setNoLoadWeightsDtype(DataType::MLLM_TYPE_F32);

    EXPECT_EQ(Backend::global_backends[MLLM_XNNPACK] != nullptr, true);

    Tensor x(1, 1, 256, 1024, Backend::global_backends[MLLM_XNNPACK], true);
    x.setTtype(TensorType::INPUT_TENSOR);

    for (int i = 0; i < 256 * 1024; ++i) {
        *(x.hostPtr<float>() + i) = 1024.f;
    }

    auto out = model({x})[0];

    for (int i = 0; i < 256 * 2048; ++i) {
        EXPECT_EQ(*(out.hostPtr<float>() + i) < 1e-18, true);
    }

    out.printShape();
}

Unlike the dynamic graph mode in MLLM, xnnpack operates on a static graph. This necessitates a mechanism to convert from a dynamic graph to a static graph. The xnnpack backend wrapper in MLLM will add several layers on top of the LinearModule to register input external and output external Tensors. The final wrapped module, as shown in the following pseudocode:

Layer: Direct(Direct::ExternalInput)
Module: LinearModule()
Layer: Direct(Direct::ExternalOutput)
Layer: Dispatch()

You can find more use cases in https://github.com/chenghuaWang/mllm/blob/main/test/xnnpack/

How are the operators in MLLM's xnnpack backend implemented?

Take XpAdd operation as an example:

The XpAdd‘s reshape function is identical to that of CPUAdd. The main differences lie in the setUp and execute functions.

Upon calling execute, XpAdd will integrate a static graph node into the xnnpack subgraph. However, XpAdd performs no actions during the setUp phase. This is because, during the setUp stage, we need to allow the XpDirect Op to determine whether the Tensor is an external input, external output, or a regular tensor.

ErrorCode XpAdd::execute(vector<shared_ptr<Tensor>> inputs, vector<shared_ptr<Tensor>> outputs) {
    auto xpb = (XnnpackBackend *)backend();
    tryDefineAllXpTensors(xpb, inputs);
    tryDefineAllXpTensors(xpb, outputs);

    // define xnnpack op.
    auto status = xnn_define_binary(
        xpb->getXnnSubgraph(),
        xnn_binary_add,
        nullptr,
        inputs[0]->uuid(),
        inputs[1]->uuid(),
        outputs[0]->uuid(),
        0);

    if (status != xnn_status_success) {
        Log::error("XpAdd::execute Error");
        exit(-1);
    }

    return MLLM_NO_ERROR;
}