Multiple compute units for VTA?

kloud1989 commented 5 years ago

The current VTA design composes only 1 compute unit and 1 set of on-chip buffers, which constraints the overall throughput. Is it possible for VTA to support multiple compute units?

Ravenwater commented 5 years ago

Are you concerned about the throughput of a single problem, or do you want to build concurrency to increase capacity?

Of course it is possible to create N-wide VTAs. In regular execution, the compute block will receive tens of thousands of compute requests from the same network. One VTA has a lot of overhead in terms of buffers and serialization, so two VTAs are by construction slower than one bigger VTA that consumes the same resources as the two VTAs as percentage-wise the one big VTA will have many more compute resources than the two smaller VTAs. So going N-wide on the VTA cores will be less optimal than making the core bigger.

Now, N-wide compute structures do have a benefit in concurrency models that increase capacity. So if you are building a VTA accelerator card for a data center appliance where that server is running many neural network models concurrently, then N-wide VTA cores would make sense. The complexity will be in the software to drive the concurrency as you now need a resource manager and possibly need to introduce core affinity, and secondly, the efficiency of command and memory streams. As you are now cutting the command bw to a core by N you need to rebalance the ISA. And if you are using local memory you will need to design a memory controller that can aggregate reads and writes from the core, just like the GPUs need to do for their processors.

kloud1989 commented 5 years ago

Thanks for your quick response @Ravenwater.

I think increasing the core size can't solve the problem.

The current VTA design is tensorized along batch (sized to BATCH, default to 1), input channel (sized to BLOCK_IN, default to 16) & output channel (sized to BLOCK_OUT, default to 16). To do tensorization using TVM, input channel number must be a relatively large one. Otherwise, the compute resource is not going to be efficiently used. Either zero padding has to be done along the input channel dim which will result in low efficiency or the small convolution can't be shifted to VTA. In the first convolution of most image classification networks, input channel size = 3, thus VTA now can't shift this workload with efficiency. If we keep increasing the core size, there will be more operatiors can't be efficiently shifted.

However, with more cores, we can block the computation with flexibility. If the input channel size is small, but the output channel is large, we can distribute the computation of different output channels to different cores. Of course, this would require both TVM/VTA software stack support and new H/W design (core interconnect, memory partition etc.) This can be useful not only in resource managed server environment.

Ravenwater commented 5 years ago

The TPU designers settled on a 256x256 matrix multiply unit, and they are showing CNNs that are compute bound on that matmul. That size compute module would not fit on an FPGA and required an ASIC, but it is indicative of the compute density that the models at Google are exhibiting.

VTA will be bound to FPGAs for the development community and thus the compute module will be at least an order of magnitude smaller, which would mean that the VTA design space will be more memory bound than the TPU world as the efficiency of the matmul will be lower. That would mean that multi-core approaches will be even less efficient as there isn't enough concurrency to keep the cores active.

I would explore changing the compilation/scheduling strategy to leverage bigger compute cores before biting off the complexity of a multi-core pipeline. There is a very successful example in the TPU to emulate.

kloud1989 commented 5 years ago

Multi-core is a direction full of challenges indeed. It's going to be a big project. I popped this issue out just to see if any concern has been dropped on this topic.

With the systolic array size of 256x256, TPU also results in low efficiency in CNN (about half of the 65536 MACs hold useful weights). As far as I can see, with TVM/VTA, the tensorization will have some constraints on the workload size. That's why I argued increasing the core size is not good enough.

So yeah, your idea about changing the compilation/scheduling strategy is very interesting to me. Looking forward to your future work!

Ravenwater commented 5 years ago

@kloud1989 the issues of utilization and compute efficiency are going to be key to build and deliver a productive research platform. We should plan some instrumentation in both the TSIM and the VTA hardware to measure utilization and efficiency during execution. 30 years ago I invented a performance methodology called XUE: for throughput, utilization, and efficiency using the operational analysis notation, and we implemented that methodology in the Intel chip-sets. A decade later, I redid this in the NVIDIA chipsets as well. The patents are now EOL, so we can implement it freely in VTA.

The basic idea is simple. using operational analysis, you count all the instructions through the pipeline during some period T. That will give you the throughput, X. At each stage, you also measure if the stage is working on something, or if it is idle. That gives you the utilization, U. Assume your pipeline has a peak throughput of P. If your pipeline is running at 100% efficiency, you should get UP = X. However, inefficiency in the data flow due to stalls and bubbles lowers that efficiency, which you can calculate with the ratio E = X / UP. That in one paragraph is the XUE method.

tqchen commented 5 years ago

Thanks for great discussions, the community uses https://discuss.tvm.ai/ for general developing discussions, please move the discussion thread there :)

apache / tvm

Multiple compute units for VTA? #3833