aspwow commented 3 months ago

Prerequisites

Feature Description

XVERSE-MoE-A4.2B is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology which is using Mixture-of-experts (MoE) architecture. The total parameter scale of the model is 25.8 billion, with an actual number of activated parameters being 4.2 billion. The models released this time is the base model XVERSE-MoE-A4.2B. Its key features are as follows: Model Structure: XVERSE-MoE-A4.2B uses the mainstream Decoder-only Transformer network structure that extends the FFN layer of dense models to expert layers. Unlike traditional MoE model where each expert has the same size as standard FFN (such as Mixtral 8x7B), it uses more fine-grained experts, with each expert being 1/4 the size of a standard FFN. It includes shared experts and non-shared experts, where shared experts are always activated during computation, and non-shared experts are selectively activated through a Router. Training Data: The model has been thoroughly trained on a diversified and high-quality dataset consisting of 3.2 trillion of tokens, including more than 40 languages such as Chinese, English, Russian, and Spanish. The sampling ratio of different types of data is finely set, which makes the performance of Chinese and English excellent, and also takes into account the effect of other languages; The model is trained using training samples of length 8k. Training Framework: We conducted in-depth customized optimization for the unique expert routing and weight calculation logic in the MoE model, developed an efficient fusion operator to improve computational efficiency. At the same time, to address the challenges of high memory consumption and communication volume in the MoE model, we designed a processing method for overlapping computation, communication, and CPU-Offload to increase overall throughput.