dragonflyoss / Dragonfly

This repository has be archived and moved to the new repository https://github.com/dragonflyoss/Dragonfly2.
https://d7y.io
Apache License 2.0
6k stars 774 forks source link

Case Study: Dragonfly adoption in DCOS of China Mobile Group Zhejiang Co., Ltd #324

Closed allencloud closed 5 years ago

allencloud commented 5 years ago

In November 2018, Dragonfly, a cloud-native image distribution system from Alibaba, was on display at KubeCon Shanghai and has become a CNCF sandbox level project since then. Dragonfly mainly resolves the image distribution problems in Kubernetes-based distributed application orchestration systems. In 2017, open source became one of Alibaba's most central infrastructure technologies. In this article, we discuss the production practices of the Dragonfly-based unified file distribution platform in China Mobile Group Zhejiang Co., Ltd DCOS.

image

A year after Alibaba adopted open source as a core technology, Dragonfly has been used in a variety of industrial fields. DCOS is the container cloud platform at China Mobile Group Zhejiang Co., Ltd. Currently, 185 application systems are running on this platform, including core systems such as the China Mobile service mobile app and the CRM application. This article mainly describes Dragonfly's implementation in the container cloud platform (DCOS) at China Mobile Group Zhejiang Co., Ltd to resolve problems in the large-scale cluster scenario, such as low distribution efficiency, low success rate, and difficult network bandwidth control. In addition, Dragonfly upgraded its features and established high availability deployment based on feedback from the DCOS platform to the community.

Challenges Faced by the DCOS Container Cloud in the Production Environment

As the DCOS container cloud platform continuously improves and hosts more and more applications (nearly 10,000 running containers), it has become increasingly difficult for distribution service systems using traditional C/S (client-server) architecture to meet requirements in scenarios such as publishing code packages and transmitting files in large-scale distributed applications due to the following reasons:

What Is Dragonfly?

Before we describe Dragonfly, let's quickly recap some basic concepts in computer networking. P2P (peer-to-peer) is a node-to-node network technology that connects individual nodes and distributes resources and services in networks among individual nodes. Information transmission and service implementation are carried out directly across nodes to avoid single-node performance bottlenecks that may otherwise occur in traditional C/S architecture.

image

Dragonfly is a CNCF open-source file distribution service solution based on the P2P and CDN technologies and suitable for distributing container images and files. Dragonfly can efficiently resolve low file and image distribution efficiency, low success rate, and network bandwidth control problems in an enterprise's large-scale cluster scenarios. Core components of Dragonfly:

Dragonfly distribution principle (take image distribution, for example): Unlike ordinary files, container images consist of multiple storage layers. Downloading container images is also performed at a layer level instead of downloading a single file. Images in each layer can be divided into data blocks and serve as seeds. After container images are downloaded, the unique IDs of images in each layer and the sha256 algorithm are used to combine downloaded images into complete images. Consistency is ensured during the downloading process.

image

The following diagram shows how images are downloaded in Dragonfly.

image

  1. The dfget-proxy intercepts the image download request (docker pull) from the docker client and converts it into the dfget download request targeting the SuperNode.
  2. The SuperNode downloads images from the image source warehouse and divides them into multiple seed data blocks.
  3. The dfget downloads data blocks and openly shares the downloaded data blocks. The SuperNode records information about downloading data blocks and guides the subsequent requests to download data blocks across nodes in a P2P manner.
  4. The Docker daemon uses its image pull mechanism to combine image files into complete images.

Based on the preceding Dragonfly characteristics and the actual production conditions, China Mobile Group Zhejiang Co., Ltd decided to introduce the Dragonfly technology into its container cloud platform to reform its existing code package publishing model, share the transmission bandwidth bottleneck on a single file server by using a P2P network, and ensure the consistency of image files throughout the publishing process.

Solution: Unified Distribution Platform

Functional Architecture Design

Based on the Dragonfly technology and the production practices of China Mobile Group Zhejiang Co., Ltd, the unified distribution platform has the following overall design objectives:

Based on these objectives, the overall architecture design is as follows:

image

The P2P network layer is a distribution network that consists of multiple computing nodes and allows different heterogeneous clusters (host clusters, K8s clusters, and Mesos clusters) to be connected.

As the core architecture of the entire universal distribution system, the distribution service layer consists of the functional modules and the storage modules. Among them, the user access authentication module provides the system login verification feature; based on Dragonfly, the distribution control module implements task distribution in a P2P manner; the traffic control module enables tenants to configure bandwidth for different tasks; the configuration info database is responsible for recording basic information, such as target clusters in the network layer and task status; the status query module enables users to closely monitor the distribution task progress; the user action layer consists of any number of interface-based clients.

Technical Architecture Implementation

According to the preceding platform design objectives and architecture analyses, the DOCS container cloud team conducted secondary development of the platform features based on the open-source components, including the following:

image

Technical Characteristics

df-client implements container mirroring. The lightweight container deployment improves networking efficiency. The cluster host nodes that are newly added to the network layer can start P2P Agent nodes in a few seconds by downloading and starting images.

The core interface layer (Docktrans) screens the command-line details at the bottom layer of dfget and provides interface-based features to simplify user operations. Distributing to multiple P2P task nodes via unified remote calls eliminates the need for users to perform download operations, like dfget, node by node and simplifies the "one-to-many" task launching model.

Core Functional Modules: Interaction Process of Distribution Control Interfaces

The following figure shows how the core modules of the unified distribution platform distribute tasks.

  1. A user uses the client to create an image or file distribution task.
  2. The distribution module judges whether the user has the distribution permission by using the authentication feature provided by the API service gateway (Edgetrans).
  3. After the user passes authentication, sets parameters for the distribution task, and provides the cluster ID, the platform reads cluster configuration information from the MySQL database to implement the self-discovery of the cluster nodes. The user can also specify multiple node IPs as custom cluster parameters.
  4. Depending on the distribution type, the distribution module in the core service layer (Docktrans) converts front-end distribution requests into dfget (for files) or Docker pull (for images) commands, and distributes commands down to multiple node df-clients for processing by remotely calling the Docker Service.
  5. During the process of performing the task, task progress and task transaction logs are written to the Redis database and the MySQL database, respectively, to enable users to query task status.

image

Production Environment Reformation Results

Currently, over 200 business systems and over 1,700 application modules that are currently running in the production environment have been optimized to use the image publishing model. The time consumption for publishing and the publishing success rate have significantly improved. After the P2P image publishing method is adopted, the monthly success rate of publishing multiple applications at a time is steady at 98%.

image

After April, the container cloud platform began using the P2P image publishing method in place of the code package publishing model in traditional distribution systems. After the platform is reformed, publishing multiple applications intensively at once significantly reduces time consumption (by 67% on average).

image

In the meantime, the container cloud platform selects multiple application clusters to test the efficiency in publishing a single application's P2P images after the transformation. As we can see, the time consumption for publishing a single application is significantly reduced (by 81.5% on average) compared with consumption by the platform before reformation.

image

Subsequent Utilization Plans

The unified file distribution platform has resolved the efficiency and consistency problems faced by China Mobile Group Zhejiang Co., Ltd when using its DCOS platform to publish code and has become a key component of the platform. The unified file distribution platform also supports efficient file distribution in larger-scale clusters. This distribution platform can be consecutively applied to batch-distribute cluster installation media and batch-update cluster configuration files.

Community Co-Construction: Interface Function Display

Community Requirements Resulting from Directly Introducing Dragonfly

Currently, the interface-based client is almost developed and is in production testing and deployment. The four planned core features of the distribution platform are Task Management, Target Management, Permission Management, and System Analysis. Currently, the first three features are available.

Permission Management

Permission Management (namely, user management) is designed to provide customized permission management features targeting different users, as listed below:

image

image

Target Management

Target Management enables users to manage target cluster nodes when distributing tasks and manage P2P cluster networking, as well as cluster node status and health, as described below:

image

image

Task Management

Task Management enables users to create, delete, and stop file or image distribution tasks and perform other operations, as detailed below:

image

image

System Analysis (coming soon)

The system analysis feature is expected to be released later to provide platform administrators and users with statistical graphs showing information such as task distribution time consumption, success rate, and task execution efficiency and facilitate platform intelligence via data statistics and prediction.

Community Co-Construction: High-Availability Deployment of Production

Active-standby mirror database disaster tolerance ensures data consistency between the active and standby databases through image synchronization.

image

Dragonfly Community Sharing

Tai Yun, a contributor in the Dragonfly community, said during a Dragonfly Meetup, "Dragonfly is now a CNCF sandbox project with 2700+ stars. Many enterprises are using Dragonfly to resolve various problems they have encountered when distributing images and files. We will continuously improve Dragonfly to provide a more powerful and simpler distribution tool for cloud-native applications. I look forward to working with you to make Dragonfly a CNCF 'graduated' project as soon as possible."

Dragonfly Roadmap

We currently plan to contribute interface feature displays to the CNCF Dragonfly community to further enrich community content. We hope that more people join and help to improve the community.

Authors: Chen Yuanzheng, Cloud Computing Architect at China Mobile Group Zhejiang Co., Ltd Wang Miaoxin, Cloud Computing Architect at China Mobile Group Zhejiang Co., Ltd

To learn more about Dragonfly, visit https://developer.alibabacloud.com/opensource/project/dragonfly Official GitHub page: https://github.com/dragonflyoss/Dragonfly

starnop commented 5 years ago

@allencloud We have a bog here https://d7y.io/zh-cn/blog/china-mobile-practice.html. What do you think of closing it now?

allencloud commented 5 years ago

@allencloud We have a bog here https://d7y.io/zh-cn/blog/china-mobile-practice.html. What do you think of closing it now?

Do we have an English version of this? If we do, then I think we could close this issue. @Starnop

ForgetMe17 commented 5 years ago

how can i post a blog on https://d7y.io/zh-cn/blog

allencloud commented 5 years ago

Hi, @ForgetMe17

how can i post a blog on https://d7y.io/zh-cn/blog

You could submit a PR to https://github.com/dragonflyoss/website/tree/master/blog/zh-cn.

In addition, have you ever read the issue https://github.com/dragonflyoss/Dragonfly/issues/219. If you are using Dragonfly, please leave a comment. 🌞 🎴 🐯 πŸŽ–

starnop commented 5 years ago

Make a record here.

The English version of this blog has been added in the pr https://github.com/dragonflyoss/website/pull/37

starnop commented 5 years ago

Has been added in the repo dragonflyoss/website. So close it now.