2lambda123/huawei-noah-xingtian

Introduction

XingTian (刑天) is a componentized library for the development and verification of reinforcement learning algorithms. It supports multiple algorithms, including DQN, DDPG, PPO, and IMPALA etc, which could training agents in multiple environments, such as Gym, Atari, Torcs, StarCraftII and so on. To meet users' requirements for quick verification and solving RL problems, four modules are abstracted: Algorithm, Model, Agent, and Environment. They work in a similar way as the combination of `Lego' building blocks. For details about the architecture, please see the Architecture introduction.

Dependencies

# ubuntu 18.04
sudo apt-get install python3-pip libopencv-dev -y
pip3 install opencv-python

# run with tensorflow 1.15.0 or tensorflow 2.3.1
pip3 install zmq h5py gym[atari] tqdm imageio matplotlib==3.0.3 Ipython pyyaml tensorflow==1.15.0 pyarrow lz4 fabric2 absl-py psutil tensorboardX setproctitle

or, using pip3 install -r requirements.txt

If your want to used PyTorch as the backend, please install it by yourself. Ref Pytorch

Installation

# cd PATH/TO/XingTian 
pip3 install -e .

After installation, you could use import xt; print(xt.__Version__) to check whether the installation is successful.

In [1]: import xt

In [2]: xt.__version__
Out[2]: '0.3.0'

Quick Start

Setup configuration

Follow's configuration shows a minimal example with Cartpole environment. More detailed description with the parameters of agent, algorithm and environment could been find in the User guide .

alg_para:
  alg_name: PPO
  alg_config:
    process_num: 1
    save_model: True  # default False
    save_interval: 100

env_para:
  env_name: GymEnv
  env_info:
    name: CartPole-v0
    vision: False

agent_para:
  agent_name: PPO
  agent_num : 1
  agent_config:
    max_steps: 200
    complete_step: 1000000
    complete_episode: 3550

model_para:
  actor:
    model_name: PpoMlp
    state_dim: [4]
    action_dim: 2
    input_dtype: float32
    model_config:
      BATCH_SIZE: 200
      CRITIC_LOSS_COEF: 1.0
      ENTROPY_LOSS: 0.01
      LR: 0.0003
      LOSS_CLIPPING: 0.2
      MAX_GRAD_NORM: 5.0
      NUM_SGD_ITER: 8
      SUMMARY: False
      VF_SHARE_LAYERS: False
      activation: tanh
      hidden_sizes: [64, 64]

env_num: 10

In addition, your could find more configuration sets in examples directory.

Start training task

python3 xt/main.py -f examples/cartpole_ppo.yaml -t train

Evaluate local trained model

Set benchmark.eval.model_path for evaluation within the YOUR_CONFIG_FILE.yaml

benchmark:
  eval:
    model_path: /YOUR/PATH/TO/EVAL/models
    gap: 10           # index gap of eval model
    evaluator_num: 1  # the number of evaluator instance

# run command
python3 xt/main.py -f examples/cartpole_ppo.yaml -t evaluate

NOTE: XingTian start with -t train as default.

Run with CLI

# Could replace `python3 xt/main.py` with `xt_main` command!
xt_main -f examples/cartpole_ppo.yaml -t train

# train with evaluate
xt_main -f examples/cartpole_ppo.yaml -t train_with_evaluate

Develop with Custom case

Write custom module, and register it. More detail guidance on custom module can be found in the Developer Guide
Add YOUR-CUSTOM-MODULE name into your_train_configure.yaml
Start training with xt_main -f path/to/your_train_configure.yaml :)

Reference Results

Episode Reward Average

DQN Reward after 10M time-steps (40M frames).

env XingTian Basic DQN RLlib Basic DQN Hessel et al. DQN

BeamRider 6706 2869 ~2000

Breakout 352 287 ~150

QBert 14087 3921 ~4000

SpaceInvaders 947 650 ~500
PPO Reward after 10M time-steps (40M frames).

env XingTian PPO RLlib PPO Baselines PPO

BeamRider 4877 2807 ~1800

Breakout 341 104 ~250

QBert 14771 11085 ~14000

SpaceInvaders 1025 671 ~800
IMPALA Reward after 10M time-steps (40M frames).

env XingTian IMPALA RLlib IMPALA

BeamRider 2313 2071

Breakout 334 385

QBert 12205 4068

SpaceInvaders 742 719

env	XingTian Basic DQN	RLlib Basic DQN	Hessel et al. DQN
BeamRider	6706	2869	~2000
Breakout	352	287	~150
QBert	14087	3921	~4000
SpaceInvaders	947	650	~500

env	XingTian PPO	RLlib PPO	Baselines PPO
BeamRider	4877	2807	~1800
Breakout	341	104	~250
QBert	14771	11085	~14000
SpaceInvaders	1025	671	~800

env	XingTian IMPALA	RLlib IMPALA
BeamRider	2313	2071
Breakout	334	385
QBert	12205	4068
SpaceInvaders	742	719

Throughput

DQN

env XingTian Basic DQN RLlib Basic DQN

BeamRider 129 109

Breakout 117 113

QBert 111 90

SpaceInvaders 115 100
PPO

env XingTian PPO RLlib PPO

BeamRider 2422 1618

Breakout 2497 1535

QBert 2436 1617

SpaceInvaders 2438 1608
IMPALA

env XingTian IMPALA RLlib IMPALA

BeamRider 8756 3637

Breakout 8814 3525

QBert 8249 3471

SpaceInvaders 8463 3555

env	XingTian Basic DQN	RLlib Basic DQN
BeamRider	129	109
Breakout	117	113
QBert	111	90
SpaceInvaders	115	100

env	XingTian PPO	RLlib PPO
BeamRider	2422	1618
Breakout	2497	1535
QBert	2436	1617
SpaceInvaders	2438	1608

env	XingTian IMPALA	RLlib IMPALA
BeamRider	8756	3637
Breakout	8814	3525
QBert	8249	3471
SpaceInvaders	8463	3555

Experiment condition： 72 Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz with single Tesla V100

Ray's reward data come from https://github.com/ray-project/rl-experiments, and Throughout from ray 0.8.6 with the same machine condition.

Acknowledgement

XingTian refers to the following projects: DeepMind/scalable_agent, baselines, ray.

License

The MIT License(MIT)