Parallelization possibilities supported by Timeloop

NVlabs / timeloop

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.

https://timeloop.csail.mit.edu/

BSD 3-Clause "New" or "Revised" License

325 stars 101 forks source link

Parallelization possibilities supported by Timeloop #84

Closed GuillaumeDEVIC closed 3 years ago

GuillaumeDEVIC commented 3 years ago

Hello,

I am looking at your interesting work on Timeloop+Acerlegy. For the moment, I would like to explore the different possible architectural topologies that Timeloop provides.

Based on exercise 4 of the tutorial for Timeloop (https://github.com/Accelergy-Project/timeloop-accelergy-exercises), I can see that it is possible to parallelize PEs connected to the same GlobalBuffer (left image below). If I want to get the image on the right below which has two GlobalBuffer of different sizes (for example: BigBuffer_depth: 8192 ; LitlleBuffer_depth: 1024) each connected to a group of PEs.

Is it possible to do this architecture? If yes, what would be the syntax I have to respect for this type of parallelization?

github_question

Thanks in advance.

angshuman-parashar commented 3 years ago

It's possible to configure the architecture on the right, but Timeloop will natively only model equally-sized buffers at the moment. In the arch spec, set an instance count for the GlobalBuffer like you do for the PEs, e.g., GlobalBuffer[0..1].

midiareshadi commented 3 years ago

Hello,

I am looking at your interesting work on Timeloop+Acerlegy. For the moment, I would like to explore the different possible architectural topologies that Timeloop provides.

Based on exercise 4 of the tutorial for Timeloop (https://github.com/Accelergy-Project/timeloop-accelergy-exercises), I can see that it is possible to parallelize PEs connected to the same GlobalBuffer (left image below). If I want to get the image on the right below which has two GlobalBuffer of different sizes (for example: BigBuffer_depth: 8192 ; LitlleBuffer_depth: 1024) each connected to a group of PEs.

Is it possible to do this architecture? If yes, what would be the syntax I have to respect for this type of parallelization?

Thanks in advance.

Hi, I defined architecture in this arch file. The arch file includes three subtrees:

System subtree includes Main Memory
Cluster subtree includes two storage instance, Bigbuffer, and Littlebuffer
PE subtree includes three PEs All parts of simulation inputs are here, and you can find simulation results in the output folder.

leehc257 commented 3 years ago

hello, @angshuman-parashar

I am making good use of timeloop with your help. Trying to do the same parallel processing as above. As you told me, I want to connect the memory of [0..2] with the PE of 64x16. However, I get the following error. Can timeloop solve this problem or not?

error :

architecture & constraint :

angshuman-parashar commented 3 years ago

Think about what you are doing. If you tell Timeloop there are 3 GLB instances, it will try to perform a spatial partitioning of the compute space across those instances. That's not your intent here -- you are simply trying to create a partitioned buffer to hold different tensors. The way you want to express them is as a stack -- a WeightGLB, an InputGLB and an OutputGLB. Set the bypasses for each of them and you should be all set.

leehc257 commented 3 years ago

I tried building arches like you said. Timeloop is running without compilation errors. But I am wondering why SRAM other than o_f_sram_glb does not communicate with DRAM (e.g. DRAM <==> weight_sram_glb). I think because of the overlapped for-loop in the 7 layer conv. problem. Are my predictions correct?

arch & constraint:

report:

angshuman-parashar commented 3 years ago

The way we've configured the system, Timeloop thinks DRAM to i_f_sram traffic is going through the o_f_sram network stop (i.e., over 2 consecutive network segments, DRAM->o_f_sram and o_f_sram->i_f_sram). However, inputs are bypassed at the o_f_sram so they don't consume any capacity, bandwidth or read/write energy.

leehc257 commented 3 years ago

Thank you for your quick and kind response

GuillaumeDEVIC commented 3 years ago

Hello,

I would like to ask you a second question after which I will close this issue : ) .

Thanks to @angshuman-parashar and @midiareshadi for your answers. Especially for the example which is very instructive (simple and efficient 👍).

Initially I asked you if it is possible to have two different buffers with their own PE cluster. And I understood that this is not allowed by Timeloop today.

Continuing with the question of paralelization, is it possible to describe within a PE an adder tree (shown below)? And more specifically, is it possible to define a buffer with one 32-bit input and several 16-bit outputs?

github_question_4

Thanks in advance.

angshuman-parashar commented 3 years ago

What I'm seeing here is a 4-way spatial-fanout from the buffer into 4 lanes. Let's call that a new "Lane" level in Timeloop. You can set a buffer of size 0 and bypass them for all tensors at the Lane level.

We model adder-trees as networks in Timeloop. In fact, there should be a reduction-tree network module already available. What you want to do here is instantiate a separate network for the read-fill data between the Buffer to the Lane level (this can be a classic XYNoC network) and another Reduction Tree network for update-drain data between the Buffer and Lane level. Connect the networks up to the read, fill, update and drain ports at each buffer and you should be all set.

You can set the word_bits for each network independently.