NexaAI / Awesome-LLMs-on-device

Awesome LLMs on Device: A Comprehensive Survey
MIT License
920 stars 122 forks source link

πŸš€ Awesome LLMs on Device: A Must-Read Comprehensive Hub by Nexa AI

[![Discord](https://dcbadge.limes.pink/api/server/thRu2HaK4D?style=flat&compact=true)](https://discord.gg/thRu2HaK4D) [On-device Model Hub](https://model-hub.nexa4ai.com/) / [Nexa SDK Documentation](https://docs.nexaai.com/) [release-url]: https://github.com/NexaAI/nexa-sdk/releases [Windows-image]: https://img.shields.io/badge/windows-0078D4?logo=windows [MacOS-image]: https://img.shields.io/badge/-MacOS-black?logo=apple [Linux-image]: https://img.shields.io/badge/-Linux-333?logo=ubuntu
Summary of on-device LLMs’ evolution
Summary of On-device LLMs’ Evolution

🌟 About This Hub

Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.

πŸš€ Why This Hub is a Must-Read

πŸ“š What's Inside Our Hub

Foundations and Preliminaries

Evolution of On-Device LLMs

LLM Architecture Foundations

On-Device LLMs Training

Limitations of Cloud-Based LLM Inference and Advantages of On-Device Inference

The Performance Indicator of On-Device LLMs

Efficient Architectures for On-Device LLMs

Model Performance Computational Efficiency Memory Requirements
MobileLLM High accuracy, optimized for sub-billion parameter models Embedding sharing, grouped-query attention Reduced model size due to deep and thin structures
EdgeShard Up to 50% latency reduction, 2Γ— throughput improvement Collaborative edge-cloud computing, optimal shard placement Distributed model components reduce individual device load
LLMCad Up to 9.3Γ— speedup in token generation Generate-then-verify, token tree generation Smaller LLM for token generation, larger LLM for verification
Any-Precision LLM Supports multiple precisions efficiently Post-training quantization, memory-efficient design Substantial memory savings with versatile model precisions
Breakthrough Memory Up to 4.5Γ— performance improvement PIM and PNM technologies enhance memory processing Enhanced memory bandwidth and capacity
MELTing Point Provides systematic performance evaluation Analyzes impacts of quantization, efficient model evaluation Evaluates memory and computational efficiency trade-offs
LLMaaS on device Reduces context switching latency significantly Stateful execution, fine-grained KV cache compression Efficient memory management with tolerance-aware compression and swapping
LocMoE Reduces training time per epoch by up to 22.24% Orthogonal gating weights, locality-based expert regularization Minimizes communication overhead with group-wise All-to-All and recompute pipeline
EdgeMoE Significant performance improvements on edge devices Expert-wise bitwidth adaptation, preloading experts Efficient memory management through expert-by-expert computation reordering
JetMoE Outperforms Llama27B and 13B-Chat with fewer parameters Reduces inference computation by 70% using sparse activation 8B total parameters, only 2B activated per input token
Pangu-$\pi$ Pro Neural architecture, parameter initialization, and optimization strategy for billion-level parameter models Embedding sharing, tokenizer compression Reduced model size via architecture tweaking
Zamba2 2x faster time-to-first-token, a 27% reduction in memory overhead, and a 1.29x lower generation latency compared to Phi3-3.8B. Hybrid Mamba2/Attention architecture and shared transformer block 2.7B parameters, fewer KV-states due to reduced attention

Model Compression and Parameter Sharing

Collaborative and Hierarchical Model Approaches

Memory and Computational Efficiency

Mixture-of-Experts (MoE) Architectures

Hybrid Architectures

General Efficiency and Performance Improvements

Model Compression and Optimization Techniques for On-Device LLMs

Quantization

Pruning

Knowledge Distillation

Low-Rank Factorization

Hardware Acceleration and Deployment Strategies

Popular On-Device LLMs Framework

Hardware Acceleration

Applications

Model Reference

Model Institute Paper
Gemini Nano Google Gemini: A Family of Highly Capable Multimodal Models
Octopus series model Nexa AI Octopus v2: On-device language model for super agent
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
Octopus v4: Graph of language models
Octopus: On-device language model for function calling of software APIs
OpenELM and Ferret-v2 Apple OpenELM is a significant large language model integrated within iOS to enhance application functionalities.
Ferret-v2 significantly improves upon its predecessor, introducing enhanced visual processing capabilities and an advanced training regimen.
Phi series Microsoft Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
MiniCPM Tsinghua University A GPT-4V Level Multimodal LLM on Your Phone
Gemma2-9B Google Gemma 2: Improving Open Language Models at a Practical Size
Qwen2-0.5B Alibaba Group Qwen Technical Report

Tutorials and Learning Resources

🀝 Join the On-Device LLM Revolution

We believe in the power of community! If you're passionate about on-device AI and want to contribute to this ever-growing knowledge hub, here's how you can get involved:

  1. Fork the repository
  2. Create a new branch for your brilliant additions
  3. Make your updates and push your changes
  4. Submit a pull request and become part of the on-device LLM movement

⭐ Star History ⭐

Star History Chart

πŸ“– Cite Our Work

If our hub fuels your research or powers your projects, we'd be thrilled if you could cite our paper here:

@article{xu2024device,
  title={On-Device Language Models: A Comprehensive Review},
  author={Xu, Jiajun and Li, Zhiyuan and Chen, Wei and Wang, Qun and Gao, Xin and Cai, Qi and Ling, Ziyuan},
  journal={arXiv preprint arXiv:2409.00088},
  year={2024}
}

πŸ“„ License

This project is open-source and available under the MIT License. See the LICENSE file for more details.

Don't just read about the future of AI – be part of it. Star this repo, spread the word, and let's push the boundaries of on-device LLMs together! πŸš€πŸŒŸ