HazyResearch / CaffeConTroll

76 stars 19 forks source link

Caffe Con Troll v. 0.1

This is an Alpha release of CaffeConTroll. Feedback is welcome!

See our paper and slides presented the 2015 SIGMOD workshop on Data Analytics at Scale (DanaC)

Table of Contents

Overview

Caffe con Troll (CcT) is a clone of the uber popular Caffe framework for Deep Learning. CcT takes the same input files and produces the same outputs as Caffe, but has rebuilt internals. We're academics, which means that CcT is built for a research purpose: to explore the relative efficiency of GPUs and CPUs for Deep Learning.

Why Study CPU versus GPU? Well, there is an ongoing debate about this with lots of passion on both sides! GPU's are wildly popular with some companies that are rumored to be installing purpose-built infrastructures for deep learning; other companies have opted to use CPUs and claimed they are cheaper and more efficient. For users outside the web companies, the situation is different: some cloud providers don't have GPUs or their GPUs are not as rapidly updated as their CPUs. In the lab, GPUs can be expensive to obtain. In contrast, academic labs like ours have CPUs lying around for other purposes, so we were curious about how much throughput we could get from CPUs for Deep Learning. Our current results show that CcT's CPU code is an order of magnitude faster than Caffe's CPU code:

New Techniques In the initial version of CcT, CcT's algorithms are identical to Caffe from a statistical point of view. However, CcT uses batching, device scheduling and other techniques to speed up end-to-end network execution time. In the near future, we plan to extend CcT in a few directions:

Of course, if you have feedback or challenge problems, let us know!

Getting Started VM

Probably the easiest way to try CcT is via a VM. These are publicly available on AWS and Azure.

EC2 g2.2xlarge: (CCT-0.1-1GPU) ami-00b5ae68

EC2 c4.4xlarge: (CCT-0.1-CPU) ami-58b1aa30

EC2 g2.8xlarge: (CCT-0.1-4GPU) ami-c75db8ac

Instructions for each AMI are listed in the file

/home/ubuntu/AMI_INSTRUCTIONS.txt

Azure D-Series: (See instructions below)

EC2 g2.8xlarge

For example, consider the g2.8xlarge AMI, which can be used to run AlexNet on 4 GPUs.

First, open the EC2 instance: (CCT-0.1-4GPU) ami-c75db8ac

Once the AMI is opened, look at AMI_INSTRUCTIONS.txt:

Follow these instructions to load the correct libraries and change to the CaffeConTroll root directory.

Once that is done, run AlexNet on 1 GPU:

./caffe-ct train tests/imagenet_train/solver/alexnet_solver_1GPU.prototxt -o tests/model.bin

Argument description:

Notice that a forwards + backwards iteration, including gradient updates, takes 2.75s.

Next, run with 1 GPU as well as the CPU. The command is the same, except for a different prototxt file which specifies that the CPU should also be used:

./caffe-ct train tests/imagenet_train/solver/alexnet_solver_1GPU_CPU.prototxt -o tests/model.bin

Finally, run with 4 GPUs. Once again the command is the same, except for a different prototxt file which specifies that 4 GPUs should be used:

./caffe-ct train tests/imagenet_train/solver/alexnet_solver_4GPU.prototxt -o tests/model.bin

Notice a > 3x speedup on the current AMI compared to 1 GPU. A speedup of 4x on this 4 GPU instance will be available following the completion of the model update portion of the distributed CCT project.

These results are summarized below:

EC2 c4.4xlarge

To run the c4.4xlarge AMI (or the larger c4.8xlarge EC2 instance):

First, open the EC2 instance: (CCT-0.1-CPU) ami-58b1aa30

Follow the instructions in

/home/ubuntu/AMI_INSTRUCTIONS.txt

to set the correct library paths.

Once this is done, run CcT on CaffeNet:

./caffe-ct train tests/imagenet_train/solver/caffenet_solver_1000.prototxt -o tests/model.bin

CcT finishes an iteration in 3.8 seconds.

You can also run Caffe on the same prototxt file used by the caffenet_solver_1000.prototxt solver:

~/caffe/build/tools/caffe time -model=tests/imagenet_train/train_val/caffenet_train_val_1000.prototxt --iterations 1

Note that Caffe takes 16.5 seconds per iteration. Note also that Caffe is being run in "time" mode which does not perform gradient updates as part of these 16.5 seconds (CcT does).

CcT partitions each mini-batch into 16 partitions to process in parallel. The impact of batch size is shown below:

Note: When running this AMI on a different EC2 instance (e.g. running this c4.4xlarge AMI on c4.8xlarge), you may need to recompile OpenBLAS to avoid memory errors:

cd CaffeConTroll/externals/OpenBLAS-0.2.14/

make clean && make -j

Azure D-Series

To run on an Azure Standard D-Series VM (tested on Ubuntu 14.04), open a VM and then download and run the following script

wget https://raw.githubusercontent.com/HazyResearch/CaffeConTroll/master/docs/VM_Instructions/azure_setup.bash

chmod 777 azure_setup.bash

./azure_setup.bash

This will install CcT and set the correct library paths for the session. When opening a new session, follow the instructions here.

Once this is done, run CcT on CaffeNet:

./caffe-ct train tests/imagenet_train/solver/caffenet_solver_1000_azure.prototxt -o tests/model.bin

Result on D4 instance:

Result on D14 instance:

Note: When switching instances for the same VM (e.g. from D4 to D14), you may need to recompile OpenBLAS to avoid memory errors:

cd CaffeConTroll-master/externals/OpenBLAS-0.2.14/

make clean && make -j

Installation from Source

git clone git@github.com:HazyResearch/CaffeConTroll.git

make clean && make -j all

make test && ./test

It's good on a laptop, on a server, or for a snack. It is unclear whether CcT can smell the blood of christian men.

Partitioning Data for (Multiple) GPUs

CcT supports data and model parallelism across multiple GPUs. Data parallelism is recommended for all layers except fully-connected. For large fully-connected layers, model parallelism works better.

Data Parallelism For a given layer, specify the proportion of a batch to run on the GPU using the prototxt attributes:

  gpu_0_batch_proportion
  gpu_1_batch_proportion
  gpu_2_batch_proportion
  gpu_3_batch_proportion

Currently we have attributes for only the first 4 GPUs on the node (as this is most common for a single node) although CcT can support more than 4.

For example, to run the first convolutional layer of AlexNet on 1 GPU, we add one line to the layer description:

layers {
  name: "conv1"
  type: CONVOLUTION
  bottom: "data"
  top: "conv1"
  ...
  convolution_param {
    ...
  }
  gpu_0_batch_proportion: 1.0                # New line added
}

To run with data parallelism on 4 GPUs, partitioning a mini-batch across all 4 GPUs equally,

layers {
  name: "conv1"
  type: CONVOLUTION
  bottom: "data"
  top: "conv1"
  ...
  convolution_param {
    ...
  }
  gpu_0_batch_proportion: 0.25
  gpu_1_batch_proportion: 0.25
  gpu_2_batch_proportion: 0.25
  gpu_3_batch_proportion: 0.25
}

The partitions do not need to be equal. To run 40% on the CPU and 60% on GPU 2,

layers {
  name: "conv1"
  type: CONVOLUTION
  bottom: "data"
  top: "conv1"
  ...
  convolution_param {
    ...
  }
  gpu_2_batch_proportion: 0.6
}

The default is to run on the CPU, i.e. no modification to the .prototxt file is needed to run the network on the CPU.

For more examples, see the prototxt files in tests/imagenet_train/train_val/

Model Parallelism This is similar to above, but with the following syntax:

  gpu_0_depth_proportion
  gpu_1_depth_proportion
  gpu_2_depth_proportion
  gpu_3_depth_proportion

Note that on a single GPU, model parallelism and data parallelism are the same, so by default if using one GPU, use

  gpu_0_batch_proportion: 1.0

as in the example above.

Known Issues

Contact

Send flames to Chris, questions to the current team, Stefan Hadjis, Ce Zhang, and Chris, and praise to past members who built CcT, Firas Abuzaid and Shubham Gupta.