Aayush-Ankit / puma-simulator

[ASPLOS 2019] PUMA-simulator provides a detailed simulation model of a dataflow architecture built with NVM (non-volatile memory), and runs ML models compiled using the puma compiler.
MIT License
58 stars 46 forks source link

vgg16 simulation #65

Closed qzylalala closed 1 year ago

qzylalala commented 1 year ago

Hi, I want to do some work based on puma recently. I followed "how_to_run.md" to do the tests. however, when I tested vgg16, I got an error as follows:

('instruction memory size requirement', 50437)
Traceback (most recent call last):
  File "dpe.py", line 231, in <module>
    DPE().run(net)
  File "dpe.py", line 127, in run
    node_dut.node_init(self.instrnpath, self.tracepath)
  File "/work_space/puma-simulator/src/node.py", line 53, in node_init
    self.tile_list[i].tile_init (temp_instrnpath, temp_tracepath)
  File "/work_space/puma-simulator/src/tile.py", line 80, in tile_init
    self.instrn_memory.load (dict_list)
  File "/work_space/puma-simulator/src/ima_modules.py", line 669, in load
    assert (len(dict_list) <= self.size), 'instructions exceed the instruction memory size'
AssertionError: instructions exceed the instruction memory size
negishubham commented 1 year ago

Hi,

This might be due to the layer configuration that you are running. My guess is that you will get this error for a bigger layer from VGG network. So maybe try to increase the instruction memory size (tile_instrnMem_size variable in include/config.py). Try changing it to 4096 and see if it works. Also, please use the test scripts to run the PUMA simulator so that you are not missing any steps.

Thanks, Shubham

qzylalala commented 1 year ago

Thanks for your reply. I set tile_instrnMem_size=4096, and I found 4096 is the max value. But I got this error. I tried to checkout to vgg16 branch, but stiil not work.

negishubham commented 1 year ago

I can't see the error in above reply. Is the error same as before? What is the layer configuration that you are trying to run?

qzylalala commented 1 year ago

Yes, here are my configuration. I copy config from example-configs/config-cnn.py.

And because I got 178 tiles after compilation, so I set num_tile_compute=176. set num_matrix = 6 (inference)

# This file contains the configurable parameters in DPE (all hierarchies - IMA, Tile, Node)
## All user specified parameters are provided by this file only

## Debug - 0 (1): dpe simulation will (won't) produce ima/tile traces while simulating
cycles_max = 5000000 # Put both these to very large numbers (when design is bug-free)!
debug = 1
xbar_record = 1
inference = 1
training = not(inference)
sparse_opt = 1 # Flag for Sparsity optimisaton (Make it 0 for only dense computations)

## Variable to define the type of MVMU
# One of "Analog", "Digital_V1" or "Digital_V2" 
# Digital_V1 has compressed inputs (Data+Offset style)
# Digital_V2 has uncompressed inputs (Skips computations for 0 activation)
MVMU_ver = "Digital_V2"

## Operand precision (fixed point allowed only): num_bits = int_bits + frac_bits
num_bits = 16
int_bits = 4
frac_bits = num_bits - int_bits

## IMA configurable parameters (permissible values for each parameter provided here)
## Instruction generation - affected by xbar_bits, num_xbar, xbar_size.
# xbar_bits: 2, 4, 6
# num_xbar: positive integer
# xbar_size: 32, 64, 128, 256
# dac_res: positive integer <= num_bits
# adc_res: positive integer <= num_bits
# num_adc: positive integer <= num_xbar (doesn't allow more than one ADC per xbar)
# num_ALU: positive integer
# dataMem_size: (in Bytes) - 256, 512, 1024, 2048 (affects instrn width, hence capped)
# instrnMem_size: (in Bytes) - 512, 1024, 2048

# Fixed parameters
addr_width = 22 # Added to address larger address space for conv layers (#TODO: Compiler needs to fix shared memory reuse)
data_width = num_bits # (in bits)
xbdata_width = data_width # (in bits), equivalent to input_prec
instrn_width = 48 # (in bits)
# Input and Weight parameters
input_prec = 16
weight_width = 16
# Change here - Specify the IMA parameters here
xbar_bits = 2
num_matrix = 6 # each matrix is 1-fw logical xbar for inference and 1-fw, 1-bw, and 1 delta logical xbar for training. Each logical xbar for inference is 8-fw physical xbar and for training  8-fw, 8-bw and 16-delta physical xbars.
xbar_size = 128
dac_res = 1
# ADC configuration
adc_res = 8 # around 4 to 8. this value should be
num_adc_per_matrix = 2
num_adc = num_adc_per_matrix * num_matrix

# The idea is to have different ADC resolution value for each ADC.
# The number of ADC if defined by num_adc property. Currently it is 2 * num_matrix(2) = 4
# NOTE: Only taking in account indexes 0 and 2, 1 and 3 are ignored, because ADCs 1 and 3 are assumed t be equal to 0 and 2. 
adc_res_new = {
                'matrix_adc_0' : 8,
                'matrix_adc_1' : 4,
                'matrix_adc_2' : 8,
                'matrix_adc_3' : 4
              }

num_ALU = num_matrix*2
#dataMem_size = num_matrix*(6*xbar_size) # 4 for 4 input spaces within matrix (1 for f/b each, 2 for d)
dataMem_size = 4096 # 2048 is larger than num_matrix*(6*xbar_size)
instrnMem_size = 8192 #in entries

# This depends on above parameters
if (training):
    datamem_off = xbar_size * (num_matrix*6) # each matrix has 6 memory spaces (1 for f/b, 2 for d)

if (inference):
    datamem_off = xbar_size * (num_matrix*2) # each matrix has 2 memory spaces ( 1 input Xbar memory and 1 output Xbar memory) 

phy2log_ratio = num_bits / xbar_bits # ratio of physical to logical xbar #vaulue is 8
lr = 0.25 # learning rate for updates to d-xbar

## Tile configurable parameters (permissible values for each parameter provided here)
## Instruction generation - affected by num_ima
# num_ima: positive integer
# edram buswidth: positive integer <= 16 (actual buswidth - this integer*data_width)
# edram_size: (in KiloBytes) - 64, 128, 256, 512
# receive_buffer_depth: 4, 8, 12, 16, 32 (number of edram buffer entries (each entry maps to a virtual tile)) \
#        puts a cap on the maximum num ber of tiles that can send data to a tile in next layer
# receive_buffer_width: edram_buswidth/data_width (Fixed - in terms of number of neurons)
# tile_instrnMem_size: 256, 512, 1024 (in Bytes)

# Fixed parameters
instrn_width = 48 # bits (op-2, vtile_id-6, send/receive_width-8, target_addr/counter-16, vw-8, mem_addr-16)
edram_buswidth = 256 # in bits
#receive_buffer_depth = 16
receive_buffer_depth = 150 #set equal to num_tile_max
receive_buffer_width =  edram_buswidth / num_bits # size of receive buffeer entry (in terms of number of neurons)

# Change here - Specify the Tile parameters here
num_ima = 8
edram_size = 2048 # in Kilobytes (64 KB - same as issac)
tile_instrnMem_size = 4096 # in entries

## Node configurable parameters (permissible values for each parameter provided here)
## Instruction generation - affected by num_tile
# num_tile_compute =  positive integer
# inj_rate < 0.2 (depends on the mapping)
# num_port: 4, 8

# Fixed parameters
# NOC topology: cmesh (n=2, k=4, c=4) - can fit k*n*c tiles
cmesh_c = 4
num_bits_tileId =32
flit_width = 32
packet_width = edram_buswidth/data_width #in multiples of flits (data considered only - booksim consider address itself)
# (b bit of address = logN, N is the number of nodes)

# Change here - Specify the Node parameters here
num_tile_compute = 176 # number of tiles mapped by dnn (leaving input and output tiles)
num_tile_max = 168.0 # maximum number of tiles per node
num_inj_max = num_tile_max # [conservative] max number of packet injections that can occur in a cycle (each tile injects a packet into NOC each cycle)
noc_inj_rate = 0.005
noc_num_port = 4

## Node parameters - Our way of simulation just assumes all tile in one actual node
num_node = 1

# Do not change this - total number of tiles
num_tile = num_node * num_tile_compute + 2 # +1 for first tile (I/O tile) - dummy, others - compute

#Security parameters - Used to verify if the model used is encryted or authenticated (set by dpe.py)
#Do not change
encrypted = False
authenticated = False
cypher_name = ''
cypher_hash = ''

Looking forward to your reply! Thank you very much.

negishubham commented 1 year ago

No, I mean what is the kernel size, input/output channels, stride, and input FM size.

qzylalala commented 1 year ago

Sorry about that, I just use the given vgg16.cpp, withou any edition. I got 178 tiles after compilation.

/*
 *  Copyright (c) 2019 IMPACT Research Group, University of Illinois.
 *  All rights reserved.
 *
 *  This file is covered by the LICENSE.txt license file in the root directory.
 *
 */

#include <assert.h>
#include <string>
#include <vector>

#include "puma.h"
#include "conv-layer.h"
#include "fully-connected-layer.h"

void isolated_fully_connected_layer(Model model, std::string layerName, unsigned int in_size, unsigned int out_size) {

    // Input vector
    auto in = InputVector::create(model, "in", in_size);

    // Output vector
    auto out = OutputVector::create(model, "out", out_size);

    // Layer
    out = fully_connected_layer(model, layerName, in_size, out_size, in);

}

int main() {

    Model model = Model::create("vgg16");

    // Input
    unsigned int in_size_x = 224;
    unsigned int in_size_y = 224;
    unsigned int in_channels = 3;
    auto in_stream = InputImagePixelStream::create(model, "in_stream", in_size_x, in_size_y, in_channels);

    // Layer 1 (convolution) configurations
    unsigned int k_size_x1 = 3;
    unsigned int k_size_y1 = 3;
    unsigned int in_size_x1 = 224;
    unsigned int in_size_y1 = 224;
    unsigned int in_channels1 = 3;
    unsigned int out_channels1 = 64;

    // Layer 2 (convolution with max pool) configurations
    unsigned int k_size_x2 = 3;
    unsigned int k_size_y2 = 3;
    unsigned int in_size_x2 = 224;
    unsigned int in_size_y2 = 224;
    unsigned int in_channels2 = 64;
    unsigned int out_channels2 = 64;
    unsigned int max_pool_size_x2 = 2;
    unsigned int max_pool_size_y2 = 2;

    // Layer 3 (convolution) configurations
    unsigned int k_size_x3 = 3;
    unsigned int k_size_y3 = 3;
    unsigned int in_size_x3 =112;
    unsigned int in_size_y3 = 112;
    unsigned int in_channels3 = 64;
    unsigned int out_channels3 = 128;

    // Layer 4 (convolution with max pool) configurations
    unsigned int k_size_x4 = 3;
    unsigned int k_size_y4 = 3;
    unsigned int in_size_x4 =112;
    unsigned int in_size_y4 = 112;
    unsigned int in_channels4 = 128;
    unsigned int out_channels4 = 128;
    unsigned int max_pool_size_x4 = 2;
    unsigned int max_pool_size_y4 = 2;

    // Layer 5 (convolution) configurations
    unsigned int k_size_x5 = 3;
    unsigned int k_size_y5 = 3;
    unsigned int in_size_x5 =56;
    unsigned int in_size_y5 = 56;
    unsigned int in_channels5 = 128;
    unsigned int out_channels5 = 256;

    // Layer 6 (convolution) configurations
    unsigned int k_size_x6 = 3;
    unsigned int k_size_y6 = 3;
    unsigned int in_size_x6 =56;
    unsigned int in_size_y6 = 56;
    unsigned int in_channels6 = 256;
    unsigned int out_channels6 = 256;

    // Layer 7 (convolution with max pool) configurations
    unsigned int k_size_x7 = 3;
    unsigned int k_size_y7 = 3;
    unsigned int in_size_x7 =56;
    unsigned int in_size_y7 = 56;
    unsigned int in_channels7 = 256;
    unsigned int out_channels7 = 256;
    unsigned int max_pool_size_x7 = 2;
    unsigned int max_pool_size_y7 = 2;

    // Layer 8 (convolution) configurations
    unsigned int k_size_x8 = 3;
    unsigned int k_size_y8 = 3;
    unsigned int in_size_x8 =28;
    unsigned int in_size_y8 = 28;
    unsigned int in_channels8 = 256;
    unsigned int out_channels8 = 512;

    // Layer 9 (convolution) configurations
    unsigned int k_size_x9 = 3;
    unsigned int k_size_y9 = 3;
    unsigned int in_size_x9 =28;
    unsigned int in_size_y9 = 28;
    unsigned int in_channels9 = 512;
    unsigned int out_channels9 = 512;

    // Layer 10 (convolution with max pool) configurations
    unsigned int k_size_x10 = 3;
    unsigned int k_size_y10 = 3;
    unsigned int in_size_x10 =28;
    unsigned int in_size_y10 = 28;
    unsigned int in_channels10 = 512;
    unsigned int out_channels10 = 512;
    unsigned int max_pool_size_x10 = 2;
    unsigned int max_pool_size_y10 = 2;

    // Layer 11 (convolution) configurations
    unsigned int k_size_x11 = 3;
    unsigned int k_size_y11 = 3;
    unsigned int in_size_x11 =14;
    unsigned int in_size_y11 = 14;
    unsigned int in_channels11 = 512;
    unsigned int out_channels11 = 512;

    // Layer 12 (convolution) configurations
    unsigned int k_size_x12 = 3;
    unsigned int k_size_y12 = 3;
    unsigned int in_size_x12 =14;
    unsigned int in_size_y12 = 14;
    unsigned int in_channels12 = 512;
    unsigned int out_channels12 = 512;

    // Layer 13 (convolution with max pool) configurations
    unsigned int k_size_x13 = 3;
    unsigned int k_size_y13 = 3;
    unsigned int in_size_x13 =14;
    unsigned int in_size_y13 = 14;
    unsigned int in_channels13 = 512;
    unsigned int out_channels13 = 512;
    unsigned int max_pool_size_x13 = 2;
    unsigned int max_pool_size_y13 = 2;

    // Output
    unsigned int out_size_x = 7;
    unsigned int out_size_y = 7;
    unsigned int out_channels = 512;
    auto out_stream = OutputImagePixelStream::create(model, "out_stream", out_size_x, out_size_y, out_channels);

    // Layer 14 (fully-connected) configurations
    unsigned int in_size14 = 25088;
    unsigned int out_size14 = 4096;

    // Layer 15 (fully-connected) configurations
    unsigned int in_size15 = 4096;
    unsigned int out_size15 = 4096;

    // Layer 16 (fully-connected) configurations
    unsigned int in_size16 = 4096;
    unsigned int out_size16 = 1000;

    // Define network
    auto out1 = conv_layer(model, "layer" + std::to_string(1), k_size_x1, k_size_y1, in_size_x1, in_size_y1, in_channels1, out_channels1, in_stream);
    auto out2 = convmax_layer(model, "layer" + std::to_string(2), k_size_x2, k_size_y2, in_size_x2, in_size_y2, in_channels2, out_channels2, max_pool_size_x2, max_pool_size_y2, out1);
    auto out3 = conv_layer(model, "layer" + std::to_string(3), k_size_x3, k_size_y3, in_size_x3, in_size_y3, in_channels3, out_channels3, out2);
    auto out4 = convmax_layer(model, "layer" + std::to_string(4), k_size_x4, k_size_y4, in_size_x4, in_size_y4, in_channels4, out_channels4, max_pool_size_x4, max_pool_size_y4, out3);
    auto out5 = conv_layer(model, "layer" + std::to_string(5), k_size_x5, k_size_y5, in_size_x5, in_size_y5, in_channels5, out_channels5, out4);
    auto out6 = conv_layer(model, "layer" + std::to_string(6), k_size_x6, k_size_y6, in_size_x6, in_size_y6, in_channels6, out_channels6, out5);
    auto out7 = convmax_layer(model, "layer" + std::to_string(7), k_size_x7, k_size_y7, in_size_x7, in_size_y7, in_channels7, out_channels7, max_pool_size_x7, max_pool_size_y7, out6);
    auto out8 = conv_layer(model, "layer" + std::to_string(8), k_size_x8, k_size_y8, in_size_x8, in_size_y8, in_channels8, out_channels8, out7);
    auto out9 = conv_layer(model, "layer" + std::to_string(9), k_size_x9, k_size_y9, in_size_x9, in_size_y9, in_channels9, out_channels9, out8);
    auto out10 = convmax_layer(model, "layer" + std::to_string(10), k_size_x10, k_size_y10, in_size_x10, in_size_y10, in_channels10, out_channels10, max_pool_size_x10, max_pool_size_y10, out9);
    auto out11 = conv_layer(model, "layer" + std::to_string(11), k_size_x11, k_size_y11, in_size_x11, in_size_y11, in_channels11, out_channels11, out10);
    auto out12 = conv_layer(model, "layer" + std::to_string(12), k_size_x12, k_size_y12, in_size_x12, in_size_y12, in_channels12, out_channels12, out11);
    auto out13 = convmax_layer(model, "layer" + std::to_string(13), k_size_x13, k_size_y13, in_size_x13, in_size_y13, in_channels13, out_channels13, max_pool_size_x13, max_pool_size_y13, out12);
    out_stream = out13;
    isolated_fully_connected_layer(model, "layer" + std::to_string(14), in_size14, out_size14);
    isolated_fully_connected_layer(model, "layer" + std::to_string(15), in_size15, out_size15);
    isolated_fully_connected_layer(model, "layer" + std::to_string(16), in_size16, out_size16);

    // Compile
    model.compile();

    // Destroy model
    model.destroy();

    return 0;

}
negishubham commented 1 year ago

So yeah this is the problem. Don't use this file. Currently, PUMA supports per-layer simulation. Please use https://github.com/purdue-nrl/puma-compiler/blob/master/test/conv-layer.cpp to specify your layer configuration. Also use a smaller feature map size and then scale the final results.

Let's say you are using X,Y as FM size for your simulations in PUMA then you should scale your latency and energy (from hw_stats.py) by ((X_actual+2p)/(X+2p))^2 where X_actual is the actual FM size in the NN model and p is the padding.

Thanks, Shubham

qzylalala commented 1 year ago

Thank you very much, I will take a try later. Best wishes!