VincyZhang / intel-extension-for-transformers

Extending Hugging Face transformers APIs for Transformer-based models and improve the productivity of inference deployment. With extremely compressed models, the toolkit can greatly improve the inference efficiency on Intel platforms.
Apache License 2.0
0 stars 0 forks source link
Intel® Extension for Transformers ===========================

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

[![](https://dcbadge.vercel.app/api/server/Wxk3J3ZJkU?compact=true&style=flat-square)](https://discord.gg/Wxk3J3ZJkU) [![Release Notes](https://img.shields.io/github/v/release/intel/intel-extension-for-transformers)](https://github.com/intel/intel-extension-for-transformers/releases) [🏭Architecture](./docs/architecture.md)   |   [💬NeuralChat](./intel_extension_for_transformers/neural_chat)   |   [😃Inference on CPU](https://github.com/intel/neural-speed/tree/main)   |   [😃Inference on GPU](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu)   |   [💻Examples](./docs/examples.md)   |   [📖Documentations](https://intel.github.io/intel-extension-for-transformers/latest/docs/Welcome.html)

🚀Latest News


## 🏃Installation ### Quick Install from Pypi ```bash pip install intel-extension-for-transformers ``` > For more installation methods, please refer to [Installation Page](./docs/installation.md) ## 🌟Introduction Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples: * Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor) * Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper [Fast Distilbert on CPUs](https://arxiv.org/abs/2211.07715) and [QuaLA-MiniLM: a Quantized Length Adaptive MiniLM](https://arxiv.org/abs/2210.17114), and NeurIPS 2021's paper [Prune Once for All: Sparse Pre-Trained Language Models](https://arxiv.org/abs/2111.05754)) * Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa) * [NeuralChat](intel_extension_for_transformers/neural_chat), a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of [plugins](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/advanced_features.md) such as [Knowledge Retrieval](./intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md), [Speech Interaction](./intel_extension_for_transformers/neural_chat/pipeline/plugins/audio/README.md), [Query Caching](./intel_extension_for_transformers/neural_chat/pipeline/plugins/caching/README.md), and [Security Guardrail](./intel_extension_for_transformers/neural_chat/pipeline/plugins/security/README.md). This framework supports Intel Gaudi2/CPU/GPU. * [Inference](https://github.com/intel/neural-speed/tree/main) of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting [GPT-NEOX](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox), [LLAMA](https://github.com/intel/neural-speed/tree/main/neural_speed/models/llama), [MPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/mpt), [FALCON](https://github.com/intel/neural-speed/tree/main/neural_speed/models/falcon), [BLOOM-7B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/bloom), [OPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/opt), [ChatGLM2-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/chatglm), [GPT-J-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptj), and [Dolly-v2-3B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox). Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html). ## 🔓Validated Hardware
Hardware Fine-Tuning Inference
Full PEFT 8-bit 4-bit
Intel Gaudi2 WIP (FP8) -
Intel Xeon Scalable Processors ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Xeon CPU Max Series ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Data Center GPU Max Series WIP WIP WIP (INT8) ✔ (INT4)
Intel Arc A-Series - - WIP (INT8) ✔ (INT4)
Intel Core Processors - ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
> In the table above, "-" means not applicable or not started yet. ## Validated Software
Software Fine-Tuning Inference
Full PEFT 8-bit 4-bit
PyTorch 2.0.1+cpu,
2.0.1a0 (gpu)
2.0.1+cpu,
2.0.1a0 (gpu)
2.1.0+cpu,
2.0.1a0 (gpu)
2.1.0+cpu,
2.0.1a0 (gpu)
Intel® Extension for PyTorch 2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
Transformers 4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
Synapse AI 1.13.0 1.13.0 1.13.0 1.13.0
Gaudi2 driver 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42
intel-level-zero-gpu 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04
> Please refer to the detailed requirements in [CPU](intel_extension_for_transformers/neural_chat/requirements_cpu.txt), [Gaudi2](intel_extension_for_transformers/neural_chat/requirements_hpu.txt), [Intel GPU](https://github.com/intel/intel-extension-for-transformers/blob/main/requirements-gpu.txt). ## 🌱Getting Started ### Chatbot Below is the sample code to create your chatbot. See more [examples](intel_extension_for_transformers/neural_chat/docs/full_notebooks.md). #### Serving (OpenAI-compatible RESTful APIs) NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start NeuralChat server either using the Shell command or Python code. ```shell # Shell Command neuralchat_server start --config_file ./server/config/neuralchat.yaml ``` ```python # Python Code from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor server_executor = NeuralChatServerExecutor() server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log") ``` NeuralChat service can be accessible through [OpenAI client library](https://github.com/openai/openai-python), `curl` commands, and `requests` library. See more in [NeuralChat](intel_extension_for_transformers/neural_chat/README.md). #### Offline ```python from intel_extension_for_transformers.neural_chat import build_chatbot chatbot = build_chatbot() response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.") ``` ### Transformers-based extension APIs Below is the sample code to use the extended Transformers APIs. See more [examples](https://github.com/intel/neural-speed/tree/main). #### INT4 Inference (CPU) ```python from transformers import AutoTokenizer from intel_extension_for_transformers.transformers import AutoModelForCausalLM model_name = "Intel/neural-chat-7b-v3-1" prompt = "Once upon a time, there existed a little girl," tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) outputs = model.generate(inputs) ``` You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm. ```python from transformers import AutoTokenizer from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig # Download Hugging Face GPTQ/AWQ model or use local quantize model model_name = "PATH_TO_MODEL" # local path to model woq_config = WeightOnlyQuantConfig(use_gptq=True) # use_awq=True for AWQ; use_autoround=True for AutoRound prompt = "Once upon a time, a little girl" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True) outputs = model.generate(inputs) ``` #### INT4 Inference (GPU) ```python import intel_extension_for_pytorch as ipex from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM from transformers import AutoTokenizer device_map = "xpu" model_name ="Qwen/Qwen-7B" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) prompt = "Once upon a time, there existed a little girl," inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map) model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map=device_map, load_in_4bit=True) model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map) output = model.generate(inputs) ``` > Note: Please refer to the [example](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu) and [script](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py) for more details. ### Langchain-based extension APIs Below is the sample code to use the extended Langchain APIs. See more [examples](intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md). ```python from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline from langchain.chains import RetrievalQA from langchain_core.vectorstores import VectorStoreRetriever from intel_extension_for_transformers.langchain.vectorstores import Chroma retriever = VectorStoreRetriever(vectorstore=Chroma(...)) retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever) ``` ## 🎯Validated Models You can access the validated models, accuracy and performance from [Release data](./docs/release_data.md) or [Medium blog](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176). ## 📖Documentation
OVERVIEW
NeuralChat Neural Speed
NEURALCHAT
Chatbot on Intel CPU Chatbot on Intel GPU Chatbot on Gaudi
Chatbot on Client More Notebooks
NEURAL SPEED
Neural Speed Streaming LLM Low Precision Kernels Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8) Weight-only Quantization (INT4/FP4/NF4/INT8) QLoRA on CPU
GENERAL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics Objectives
Pipeline Length Adaptive Early Exit Data Augmentation
TUTORIALS & RESULTS
Tutorials LLM List General Model List Model Performance
## 🙌Demo * LLM Infinite Inference (up to 4M tokens) https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b * LLM QLoRA on Client CPU https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31 ## 📃Selected Publications/Events * CES 2024: [CES 2024 Great Minds Keynote: Bringing the Limitless Potential of AI Everywhere: Intel Hybrid Copilot demo](https://youtu.be/70J3uO3eLZA?t=1348) (Jan 2024) * Blog published on Medium: [Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling](https://medium.com/11tensors/connect-an-ai-agent-with-your-api-intel-neural-chat-7b-llm-can-replace-open-ai-function-calling-242d771e7c79) (Dec 2023) * NeurIPS'2023 on Efficient Natural Language and Speech Processing: [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502) (Nov 2023) * Blog published on Hugging Face: [Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance](https://huggingface.co/blog/Andyrasika/neural-chat-intel) (Nov 2023) * Blog published on VMware: [AI without GPUs: A Technical Brief for VMware Private AI with Intel](https://core.vmware.com/resource/ai-without-gpus-technical-brief-vmware-private-ai-intel#section6) (Nov 2023) > View [Full Publication List](./docs/publication.md) ## Additional Content * [Release Information](./docs/release.md) * [Contribution Guidelines](./docs/contributions.md) * [Legal Information](./docs/legal.md) * [Security Policy](SECURITY.md) * [Apache License](./LICENSE) ## Acknowledgements * Excellent open-source projects: [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [FastChat](https://github.com/lm-sys/FastChat), [fastRAG](https://github.com/IntelLabs/fastRAG), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [llama.cpp](https://github.com/ggerganov/llama.cpp), [lm-evauation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [peft](https://github.com/huggingface/peft), [trl](https://github.com/huggingface/trl), [streamingllm](https://github.com/mit-han-lab/streaming-llm) and many others. * Thanks to all the [contributors](./docs/contributors.md). ## 💁Collaborations Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach [us](mailto:itrex.maintainers@intel.com), and we look forward to our collaborations on Intel Extension for Transformers!