CI/CD Pipeline Using GitLab and Rancher #2

Open Hujun opened 6 years ago

We have being talked about CI/CD for years. And like other "every one knows but no one knows how" conceptions, a lot of practices, procedures, speeches and useless slides are published but seldom is useful. I don't want to complain that there are so many people are using the conceptions before they really know what exactly it is and how it works. I have to say, in most cases, they just failed to find appropriate tools and "savoir-faire" to solve the problems.

In this post, I want to introduce a CI/CD pipeline solution based on GitLab and Rancer v1.6. It may be not the best one, but I still hope it can be inspiring for your needs.

Overview

gitlab_rancher

Seeing the above diagram, for easier understanding, we can simply split the whole procedure into two consequential parts of CI and CD. The CI part is mainly rely on GitLab CI, and the CD part lives in Rancher scope. GitLab CI gives a whole life cycle of development-build-deploy with sophisticated workflow settings, while Rancher takes care of deployment related issues such as environment management, hardware resource management, service composing, monitoring and service scalability.

Evaluation Standards

Before get into more details, I want to state at first clearly my criterions of CI/CD pipelines. Because my considerations may not fit your circumstances. To avoid unnecessary arguments and waste of time, I suggest you read all these standards before continue to the following paragraphs.

According to the Occam Razor theory, I hope to keep the system as simple as possible. I dont want to use too many tools and components.
The integration of components in the pipeline should based on triggers (http request, message etc.) but not configuration.
Reuse existing best practise as mush as possible.
Offer user friendly procedures. Most operations and configurations should be done on GUI. Because a good CI/CD solution should involve all stack holders of software development, even she or he does not know much about programming, shell manipulation or devop tools.

Why GitLab CI

I believe no one will use other version control system other than Git nowadays. Among three mainstream git tools (github, bitbucket and gitlab), up to now, gitlab is the only one that has full CI function integrated, while the other two competitors can only use webhook to integrate 3rd party CI tools, which means more complexity of system and procedures.

2018-05-02 2 38 40

I have to say that GitLab CI is quite complete in functionality and easy to use. Even for community edition, you are free to use these CI functions with well edited documents. I guess it is a commercial strategy of GitLab to compete with GitHub. Obviously, it did a good job.

Why Containerization (docker)

It is not a necessary condition to use container for CI/CD. But without containerization or Docker, it will cost much more for reach the same level of:

deployment & configuration
fine grained service control
a thriving ecology of tools and methologies
cost of hardware/software resources
service scalability

I will not comment too much here the details about benefits and features of docker. I believe it has become a basic know-how to all of the developers who are living in modern software world.

Of course you may be master of ansible or saltstack, and are able to implement effortlessly a CI/CD pipeline with them. I'm not against to this kind of solution, and I have to admit that in some scenarios, it is more efficient. The CI/CD solution without docker is out of scope of this post. In fact, it is more suitable for system initializations and fundamental components deployment, but not for services.

Why Rancher

Here it comes the "vs" part. The question of "why xxx?" equals to "why not yyy?". I list the competitors of rancher below (please remind me if I miss any):

AliCloud Container Service is now simply Kubernates on AliCloud, no need to discuss in detail.

AWS ECS is perfect to manage your services only if all of your services run on AWS EC2. In other word, the most significant flaw of ECS is lack of cross-datacenter (cloud) capability. Take myself as example, most of my services run on AliCloud. In order to coordinate resources on both AWS and AliCloud, I have to introduce other component (e.g. Terraform) and more complex workflow & procedures, which means a lot of work burden and potential mainteinance efforts. Furthermore, orchestration tools in ECS as auto-scaling, secret management, auto load balancing and etc. is not so good as in rancher. All these cause ECS a small scale service suitable solution for only EC2 hosts.

Nomad is an elegantly designed system. I'm happy to use it in prototype systems. Simplicity results in beauty, and drawbacks as well. The most obvious shortage of nomad is lack of UI. From my view, a good CI/CD solution should involve all the roles of development including developers, testers, PMs etc. Not all of them can well understand all these "magic" tools and operations. A friendly UI sometimes is critical for introduction of CI/CD solution in the company. And nomad has no native service discovery, and no real configuration management neither. You have to make it up using other tools like consul and etcd.

Similar drawbacks also can be found in Docker swarm. Though Docker swarm has advantage of native support of docker engine. If you only want to cluster your containers, it is no doubt your No.1 option. But if we evaluate it from the viewpoint of whole CI/CD workflow, the missing puzzle of UI, service discovery and configuration management is not so ignorable. Nevertheless, thanks to its native interoperability with docker engine, it works very well as infrastructure layer of orchestration tool.

Mesos & Marathon based DC/OS is very like to rancher. The UI is better in my eyes. In functionalities, it has almost all the same as those in rancher, except rich authentication support. But unfortunately, DC/OS has so limited function compared to the enterprise edition. I take DC/OS as a luxury edition of Mesos, and its community edition is just for demo.

Now we finally come to the most famous one, Kubernates. I have to admit that Kubernates have all the features you can imagine. Theoretically, you can use it to do everything in CD. Containerization, service scale, smart clustering, configuration management, logging & monitoring, high availability of service, auto scale out, etc. Kubernates is now the best solution for infrastructure management, or I can say, there is no second option. But in the end I leave it in the infrastructure layer, not on the service management layer of CD. The reason has been mentioned above in my comment on nomad. Compared to the UI operation of rancher, Kubernates pod configuration is too much more devop oriented. And I don't want to expose and mix too many configs of infrastructures with service deployment configs. Someone may argue that the infrastructure & resource config and service deployment config are essentially not different. But from my personal point of view, service deployment config is more business logic mapping staff, and I'm not willing to muddle so many things together in one place, especially when the procedure concerns dozens of departments and hundreds of users in different roles. Fortunately, rancher supports Kubernates template, which means we can manage or create Kubernates clusters directly in rancher. It is really very cool. I always see Kubernates as a perfect resource pool management tool, and rancher allows a perfect usage of it.

Conclusion:

AWS ECS is limited in Amazon ecology and not good in service orchestration
Nomad and Swam has no UI, service discovery nor configuration managment
DC/OS is good, but only for enterprise edition
Kubernates is perfect for infrastructure management but not easy to use in service config of CD.
Rancher provides full feature of service configuration of deployment, scaling and clustering, has nice UI and powerful API with detailed document. It also supports Kubernates, Swarm and mesos as underlying layer.

Please notice that all these conclusion and comments are from my evaluation standards mentioned above. You may come with totally different opinions from a totally different consideration origin.

Implement Step by Step

No installation instructions of Gitlab & CI runner and rancher will be repeated here. I believe you can easily find these information on their official websites. I will only explain the tricky parts and try to introduce my own workflow of CI (maybe not suitable for you).

Rancher behind Nginx Proxy in Docker

The official document gives a good example of nginx config file for proxy deployment. But if you try to deploy rancher and nginx together using docker and has difficulty in configuration files, you can find a complete example here

.gitlab-ci.yml

My CI config as below:

variables:
  DOCKER_REGISTRY_URL: "your_private_docker_registry_url"
  # rancher server API endpoint
  # must with http scheme
  RANCHER_ENDPOINT_URL: "your_rancher_api_endpoint_url"

stages:
  - build
  - test
  - staging
  - deploy

before_script:
  - 'echo "PROJECT NAME : $CI_PROJECT_NAME"'
  - 'echo "PROJECT ID : $CI_PROJECT_ID"'
  - 'echo "PROJECT URL : $CI_PROJECT_URL"'
  - 'echo "ENVIRONMENT URL : $CI_ENVIRONMENT_URL"'
  - 'echo "DOCKER REGISTRY URL : $DOCKER_REGISTRY_URL"'
  - 'export PATH=$PATH:/usr/bin'

# after_script:

build_image:
  stage: build
  only:
    - master
    - develop
    - staging
  when: manual
  allow_failure: false
  script:
    - 'echo "Job $CI_JOB_NAME triggered by $GITLAB_USER_NAME ($GITLAB_USER_ID)"'
    - 'echo "Build on $CI_COMMIT_REF_NAME"'
    - 'echo "HEAD commit SHA $CI_COMMIT_SHA"'
    # docker repo name must be lowercase
    - 'PROJECT_NAME_LOWERCASE=$(tr "[:upper:]" "[:lower:]" <<< $CI_PROJECT_NAME)'
    - 'IMAGE_REPO=$DOCKER_REGISTRY_URL/$PROJECT_NAME_LOWERCASE/$CI_COMMIT_REF_NAME'
    - 'IMAGE_TAG=$IMAGE_REPO:$CI_COMMIT_SHA'
    - 'IMAGE_TAG_LATEST=$IMAGE_REPO:latest'
    - 'docker build -t $IMAGE_TAG -t $IMAGE_TAG_LATEST .'
    - 'OLD_IMAGE_ID=$(docker images --filter="before=$IMAGE_TAG" $IMAGE_REPO -q)'
    - '[[ -z $OLD_IMAGE_ID ]] || docker rmi -f $OLD_IMAGE_ID'
    - 'docker push $IMAGE_TAG'
    - 'docker push $IMAGE_TAG_LATEST'

deploy_test:
  stage: test
  only:
    - develop
  when: manual
  environment:
    name: test
  variables:
    CI_RANCHER_ACCESS_KEY: $CI_RANCHER_ACCESS_KEY_TEST
    CI_RANCHER_SECRET_KEY: $CI_RANCHER_SECRET_KEY_TEST
    CI_RANCHER_STACK: $CI_RANCHER_STACK
    CI_RANCHER_SERVICE: $CI_RANCHER_SERVICE
    CI_RANCHER_ENV: $CI_RANCHER_ENV_TEST
  script:
    - 'echo "Deploy for test"'
    - 'ceres rancher_deploy --rancher-url=$RANCHER_ENDPOINT_URL --rancher-key=$CI_RANCHER_ACCESS_KEY --rancher-secret=$CI_RANCHER_SECRET_KEY --service=$CI_RANCHER_SERVICE --stack=$CI_RANCHER_STACK --rancher-env=$CI_RANCHER_ENV'

deploy_validate:
  stage: staging
  only:
    - staging
  when: manual
  environment:
    name: staging
  variables:
    CI_RANCHER_ACCESS_KEY: $CI_RANCHER_ACCESS_KEY_TEST
    CI_RANCHER_SECRET_KEY: $CI_RANCHER_SECRET_KEY_TEST
    CI_RANCHER_STACK: $CI_RANCHER_STACK
    CI_RANCHER_SERVICE: $CI_RANCHER_SERVICE
    CI_RANCHER_ENV: $CI_RANCHER_ENV_STAGING
  script:
    - 'echo "Deploy for validation"'
    - 'ceres rancher_deploy --rancher-url=$RANCHER_ENDPOINT_URL --rancher-key=$CI_RANCHER_ACCESS_KEY --rancher-secret=$CI_RANCHER_SECRET_KEY --service=$CI_RANCHER_SERVICE --stack=$CI_RANCHER_STACK --rancher-env=$CI_RANCHER_ENV'

deploy_production:
  stage: deploy
  only:
    - master
  when: manual
  environment:
    name: production
  variables:
    CI_RANCHER_ACCESS_KEY: $CI_RANCHER_ACCESS_KEY_TEST
    CI_RANCHER_SECRET_KEY: $CI_RANCHER_SECRET_KEY_TEST
    CI_RANCHER_STACK: $CI_RANCHER_STACK
    CI_RANCHER_SERVICE: $CI_RANCHER_SERVICE
    CI_RANCHER_ENV: $CI_RANCHER_ENV_PROD
  script:
    - 'echo "Deploy for production"'
    - 'ceres rancher_deploy --rancher-url=$RANCHER_ENDPOINT_URL --rancher-key=$CI_RANCHER_ACCESS_KEY --rancher-secret=$CI_RANCHER_SECRET_KEY --service=$CI_RANCHER_SERVICE --stack=$CI_RANCHER_STACK --rancher-env=$CI_RANCHER_ENV'

You will find more info about GitLab CI configuration here if needed. I believe you can well understand the workflow defined in yaml file after understanding GitLab CI configurations.

Deploy using Rancher

In the job of deployment defined in above GitLab CI config file, there is only one command starts with "ceres rancher_deploy". What does it mean? In fact, "ceres" is a installed command (python package) in GitLab CI runner environment. "rancher_deploy" is a subcommand of "ceres". Full document of "rancher_deploy" command is:

Usage: ceres rancher_deploy [OPTIONS]

  Deploy using rancher API

Options:
  --rancher-url TEXT              rancher server API endpoint URL  [required]
  --rancher-key TEXT              rancher account or environment API access
                                  key  [required]
  --rancher-secret TEXT           rancher account or environment API secret
                                  corresponding to the access key  [required]
  --rancher-env TEXT              used to specify environemnt if account key
                                  is provided
  --stack TEXT                    stack name defined in rancher  [required]
  --service TEXT                  service name defined in rancher  [required]
  --batch-size INTEGER            number of containers to upgrade at once
  --batch-interval INTEGER        interval (in second) between upgrade batches
  --sidekicks / --no-sidekicks    upgrade sidekicks services at the same time
  --start-before-stopping / --no-start-before-stopping
                                  start new containers before stopping the old
                                  ones
  --help                          Show this message and exit.

Source code of rancher_deploy command as below:

@cli.command()
@click.option('--rancher-url', required=True, help='rancher server API endpoint URL')
@click.option('--rancher-key', required=True, help='rancher account or environment API access key')
@click.option('--rancher-secret', required=True, help='rancher account or environment API secret corresponding to the access key')
@click.option('--rancher-env', default=None, help='used to specify environemnt if account key is provided')
@click.option('--stack', required=True, help='stack name defined in rancher')
@click.option('--service', required=True, help='service name defined in rancher')
@click.option('--batch-size', default=1, help='number of containers to upgrade at once')
@click.option('--batch-interval', default=2, help='interval (in second) between upgrade batches')
@click.option('--sidekicks/--no-sidekicks', default=False, help='upgrade sidekicks services at the same time')
@click.option('--start-before-stopping/--no-start-before-stopping', default=False,
              help='start new containers before stopping the old ones')
def rancher_deploy(rancher_url, rancher_key, rancher_secret, rancher_env, stack, service,
                   batch_size, batch_interval, sidekicks, start_before_stopping):
    """Deploy using rancher API"""
    rancher_cli = RancherClient(rancher_url, rancher_key, rancher_secret)
    env_id = rancher_cli.environment_id(rancher_env)
    if not env_id:
        click.Abort('Environment {} not found in rancher'.format(rancher_env))
    service_info = rancher_cli.service_info(env_id, stack, service)
    if not service_info:
        click.secho('Service {} not found in rancher'.format(service), fg='red')
        click.Abort()
    service_id = service_info['id']
    click.secho('Check and finish service upgrade')
    service_info = rancher_cli.service_finish_upgrade(env_id, service_id)
    click.secho('Service info:')
    click.secho(json.dumps(service_info))

    # do upgrade
    rancher_cli.service_upgrade(env_id, service_id, batch_size, batch_interval, sidekicks, start_before_stopping)
    click.secho('Waiting for upgrade finish')
    service_info = rancher_cli.service_finish_upgrade(env_id, service_id)

    click.secho('Service {} deploy complete on {}'.format(service_info['name'], rancher_url))
    return service_info

# -*- coding: utf8 -*-

import json
from typing import Dict
from time import sleep
from copy import deepcopy

import requests

class RancherClient(object):
    def __init__(self, endpoint_url: str, key: str, secret: str):
        """
        Args:
            endpoint_url (str): rancher server API endpoint URL
            key (str): rancher account or environment API access key
            secret (str): rancher account or environment API secret corresponding to the access key
        """
        self.endpoint_url = endpoint_url
        self.s = requests.Session()
        self.s.auth = (key, secret)
        # timeout in second for retry
        self.timeout = 60

    def environment_id(self, name: str=None) -> str:
        """
        Get rancher environement ID. If using account key, return the environment ID specified by `name`.

        Args:
            name (str): name for the environment requested (only useful for account key)

        Returns:
            environment ID in string
        """
        if not name:
            r = self.s.get('{}/projects'.format(self.endpoint_url), params={'limit': 1000})
        else:
            r = self.s.get('{}/projects'.format(self.endpoint_url), params={'limit': 1000, 'name': name})
        r.raise_for_status()
        data = r.json()['data']
        if data:
            return data[0]['id']
        return None

    def service_info(self, environment_id: str, stack_name: str, service_name: str) -> Dict:
        """
        Get rancher service info by given environment id and service name.

        Args:
            environment_id (str): defined environment id in rancher
            stack_name (str): defined stack name in rancher
            service_name (str): defined service name in rancher

        Returns:
            service info in json
        """
        if not environment_id:
            raise Exception('Empty rancher environment ID')
        r = self.s.get('{}/projects/{}/stacks'.format(self.endpoint_url, environment_id),
                       params={'limit': 1000, 'name': stack_name})
        r.raise_for_status()
        data = r.json()['data']

        if not data:
            # stack not found
            raise Exception('Stack {} not found'.format(stack_name))
            return None

        stack_info = deepcopy(data[0])

        r = self.s.get('{}/projects/{}/services'.format(self.endpoint_url, environment_id),
                       params={'name': service_name})
        r.raise_for_status()
        data = r.json()['data']
        if not data:
            # service not found
            return None
        for service_info in data:
            if service_info['stackId'] == stack_info['id']:
                return service_info
        return None

    def service_finish_upgrade(self, environment_id: str, service_id: str) -> Dict:
        """
        Finish service upgrade when service is in `upgraded` state.

        Args:
            environment_id (str): defined environment id in rancher
            service_id (str): defined environment id in rancher

        Returns:
            service info in json
        """
        r = self.s.get('{}/projects/{}/services/{}'.format(self.endpoint_url, environment_id, service_id))
        r.raise_for_status()
        data = r.json()
        if data.get('type') == 'error':
            raise Exception(json.dumps(data))
        if data['state'] == 'active':
            return data

        if data['state'] == 'upgrading':
            retry = 0
            while data['state'] != 'upgraded':
                sleep(2)
                retry += 2
                if retry > self.timeout:
                    raise Exception('Timeout of rancher finish upgrade service {}'.format(service_id))
                r = self.s.get('{}/projects/{}/services/{}'.format(self.endpoint_url, environment_id, service_id))
                r.raise_for_status()
                data = r.json()

        if data['state'] != 'upgraded':
            raise Exception('Unable to finish upgrade service in state of {}'.format(data['state']))
        r = self.s.post('{}/projects/{}/services/{}/'.format(self.endpoint_url, environment_id, service_id),
                        params={'action': 'finishupgrade'})
        r.raise_for_status()

        # wait till service finish upgrading
        retry = 0
        while data['state'] != 'active':
            sleep(2)
            retry += 2
            if retry > self.timeout:
                raise Exception('Timeout of rancher finish upgrade service {}'.format(service_id))
            r = self.s.get('{}/projects/{}/services/{}'.format(self.endpoint_url, environment_id, service_id))
            r.raise_for_status()
            data = r.json()

        return data

    def service_upgrade(self, environment_id: str, service_id: str, batch_size: int=1,
                        batch_interval: int=2, sidekicks: bool=False, start_before_stopping: bool=False) -> Dict:
        """
        Upgrade service

        Args:
            environment_id (str): defined environment id in rancher
            service_id (str): defined environment id in rancher
            batch_size (int): number of containers to upgrade at once
            batch_interval (int): interval (in second) between upgrade batches
            sidekicks (bool): upgrade sidekicks services at the same time
            start_before_stopping (bool): start new containers before stopping the old ones

        Returns:
            service info in json
        """
        r = self.s.get('{}/projects/{}/services/{}'.format(self.endpoint_url, environment_id, service_id))
        r.raise_for_status()
        data = r.json()
        if data.get('type') == 'error':
            raise Exception(json.dumps(data))
        if data['state'] != 'active':
            raise Exception('Service {} in state of {}, cannot upgrade'.format(service_id, data['state']))

        upgrade_input = {'inServiceStrategy': {
            'batchSize': batch_size,
            'intervalMillis': batch_interval * 1000,
            'startFirst': start_before_stopping,
            'launchConfig': data['launchConfig'],
            'secondaryLaunchConfigs': [],
        }}
        if sidekicks:
            upgrade_input['inServiceStrategy']['secondaryLaunchConfigs'] = data['secondaryLaunchConfigs']

        r = self.s.post('{}/projects/{}/services/{}/'.format(self.endpoint_url, environment_id, service_id),
                        params={'action': 'upgrade'}, json=upgrade_input)
        r.raise_for_status()

        return r.json()

Attention that here I use version 2.0 beta of rancher API. If you want to use rancher API v1.0, some modifs must be made.

The benefits of encapsulation of rancher triggers logic in python package (command) is to hide complex of scripts in GitLab CI config. And more importantly, it decouples CI workflow and CD execution details. Otherwise, every change in rancher deployment logic will cause change of .gitlab-ci.yml file in every project following the CI/CD workflow. It would be a disaster.

An elegant solution is to build the python package into the gitlab runner docker image, and register the runner in dock mode in gitlab.

Conclusion

In this post, it introduces a solution of CI/CD pipeline based on Gitlab CI and rancher. It explains why choose gitlab and rancher to implement the whole workflow. In the end of the article, it demonstrates details of implementation and design.

~~I would open source the python package used as deployment cli in the post with more features.~~ See repo "Youtiao" for more details about rancher deployment command tool.

Rancher has implemented easier and build-in pipeline function since v2.0.x for similar operations. more details in rancher pipeline document.

Hujun / blog