After downloading, place the model into the assets folder.
Remember to decompress the *.so zip file stored in the libs/arm64-v8a folder.
The demo models were converted from HuggingFace or ModelScope and underwent code optimizations to achieve extreme execution speed.
Therefore, the inputs & outputs of the demo models are slightly different from the original one.
To better adapt to ONNX Runtime on Android, the export did not use dynamic axes. Therefore, the exported ONNX model may not be optimal for x86_64 systems.
The tokenizer.cpp and tokenizer.hpp files originated from the mnn-llm repository.
To export the model on your own, please go to the 'Export_ONNX' folder, follow the comments to set the folder path, and then execute the ***_Export.py Python script. Next, quantize / optimize the onnx model by yourself.
If use onnxruntime.tools.convert_onnx_models_to_ort to convert to the *.ort format, it will automatically add Cast operators that changes fp16 multiplication to fp32.
The quantization method for the model can be seen in the folder "Do_Quantize".
The q4(uint4) quantization method is not currently recommended because the "MatMulNBits" operator in ONNX Runtime is performing poorly.