Qihoo360 / hbox

AI on Hadoop
Apache License 2.0
1.73k stars 385 forks source link
ai caffe deeplearning hadoop machinelearning mxnet tensorflow yarn


license [Release Version]() [PRs Welcome]()

We have renamed the repositiry from XLearning to hbox.

if you have a local clone of the repository, please update your remote URL:

git remote set-url origin https://github.com/Qihoo360/hbox.git

Hbox is a convenient and efficient scheduling platform combined with the big data and artificial intelligence, support for a variety of machine learning, deep learning frameworks. Hbox is running on the Hadoop Yarn and has integrated deep learning frameworks such as Tensornet, TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost,horovod, openmpi, tensor2tensor. support GPU resource schedule, run in docker and restful api management interface. Hbox has the satisfactory scalability and compatibility.

中文文档

Architecture

architecture
There are three essential components in Hbox:

Functions

1 Support Multiple Deep Learning Frameworks

Besides the distributed mode of TensorFlow and MXNet frameworks, Hbox supports the standalone mode of all deep learning frameworks such as Caffe, Theano, PyTorch. Moreover, Hbox allows the custom versions and multi-version of frameworks flexibly.

2 Unified Data Management Based On HDFS

Training data and model result save to HDFS(support S3). Hbox is enable to specify the input strategy for the input data --input by setting the --input-strategy parameter or hbox.input.strategy configuration. Hbox support three ways to read the HDFS input data:

Similar with the read strategy, Hbox allows to specify the output strategy for the output data --output by setting the --output-strategy parameter or hbox.output.strategy configuration. There are two kinds of result output modes:

More detail see data management

3 Visualization Display

The application interface can be divided into four parts:

yarn1

4 Compatible With The Code At Native Frameworks

Except the automatic construction of the ClusterSpec at the distributed mode TensorFlow framework, the program at standalone mode TensorFlow and other deep learning frameworks can be executed at Hbox directly.

Compilation & Deployment Instructions

1 Compilation Environment Requirements

2 Compilation Method

Run the following command in the root directory of the source code:

./mvnw package

After compiling, a distribution package named hbox-1.1-dist.tar.gz will be generated under core/target in the root directory. Unpacking the distribution package, the following subdirectories will be generated under the root directory:

To setup configurations, user need to set HBOX_CONF_DIR to a folder containing a valid hbox-site.xml, or link this folder to $HBOX_HOME/conf.

3 Deployment Environment Requirements

4 Hbox Client Deployment Guide

Under the "conf" directory of the unpacking distribution package "$HBOX_HOME", configure the related files:

5 Start Method of Hbox History Service [Optional]

Quick Start

Use $HBOX_HOME/bin/hbox-submit to submit the application to cluster in the Hbox client. Here are the submit example for the TensorFlow application.

1 upload data to hdfs

upload the "data" directory under the root of unpacking distribution package to HDFS

cd $HBOX_HOME  
hadoop fs -put data /tmp/ 

2 submit

cd $HBOX_HOME/examples/tensorflow
$HBOX_HOME/bin/hbox-submit \
   --app-type "tensorflow" \
   --app-name "tf-demo" \
   --input /tmp/data/tensorflow#data \
   --output /tmp/tensorflow_model#model \
   --files demo.py,dataDeal.py \
   --worker-memory 10G \
   --worker-num 2 \
   --worker-cores 3 \
   --ps-memory 1G \
   --ps-num 1 \
   --ps-cores 2 \
   --queue default \
   python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10

The meaning of the parameters are as follows:

Property Name Meaning
app-name application name as "tf-demo"
app-type application type as "tensorflow"
input input file, HDFS path is "/tmp/data/tensorflow" related to local dir "./data"
output output file,HDFS path is "/tmp/tensorflow_model" related to local dir "./model"
files application program and required local files, including demo.py, dataDeal.py
worker-memory amount of memory to use for the worker process is 10GB
worker-num number of worker containers to use for the application is 2
worker-cores number of cores to use for the worker process is 3
ps-memory amount of memory to use for the ps process is 1GB
ps-num number of ps containers to use for the application is 1
ps-cores number of cores to use for the ps process is 2
queue the queue that application submit to

For more details, set the Submit Parameter part。

FAQ

Hbox FAQ

Authors

Hbox is designed, authored, reviewed and tested by the team at the github:

@Yuance Li, @Wen OuYang, @Runying Jia, @YuHan Jia, @Lei Wang

Contact us

qq