StanfordLegion / task-bench

A task benchmark
Apache License 2.0
39 stars 30 forks source link

I want to know the process that task-bench kernel executes on legion #98

Open AngryBear2 opened 1 year ago

AngryBear2 commented 1 year ago

Hello, I have some understanding of task-bench at present, but there are still many things that are not very clear, here I would like to ask you:

  1. It seems that type-type is the type that generates the dependencies of the task graph, and kernel is the part to be executed. For example, if I want to use stencil to execute compute_kernel, how is the generated legion code partitioned? What about data dependencies on multiple nodes? Does each node split the compute_kernel into many parts, or does each node execute the same compute_kernel?

  2. I am not familiar with the execution process of memory_kernel on legion, so I hope you can consult me.

elliottslaughter commented 1 year ago

Legion has two main sets of partitions, which are defined here:

https://github.com/StanfordLegion/task-bench/blob/bf4ad1982d4b748bf3fd54e8239e798114e6ade8/legion/main.cc#L477-L478

The primary partition is just an equal partition:

https://github.com/StanfordLegion/task-bench/blob/bf4ad1982d4b748bf3fd54e8239e798114e6ade8/legion/main.cc#L526

The secondary partitions are more complicated and encode, essentially, the dependence patterns:

https://github.com/StanfordLegion/task-bench/blob/bf4ad1982d4b748bf3fd54e8239e798114e6ade8/legion/main.cc#L529-L571

As a general rule, data in Task Bench is "fake" in that it does not encode "real" data, and the data is not consumed by any of the kernels (whether compute or memory or whatever). The data does contain information to encode where it's coming from so we can check it for correctness. But to a first approximation, you should completely separate in your mind the partitioning (which is related to the dependence pattern) and the kernels (which actually execute, but ignore all data).

Kernels are never "partitioned", they just do what they're told. So if you run a compute kernel and tell it to execute N iterations, it will always execute N iterations, regardless of the size and shape of the graph.

Absolutely nothing in Task Bench is sensitive to the number of nodes. The graph is configured explicitly (via command line paramaters) and it is up to each implementation to spread that graph out as best it sees fit. But the Task Bench core (used by each implementation) is oblivious to how many nodes/cores there are or how things are parallelized. You could just as easily make a sequential version of Task Bench that executes the same thing.

Hope that helps.

AngryBear2 commented 1 year ago

Thank you. According to your words, I understand that each node in the figure is executing the same kernel, not all nodes are executing the same kernel. How do I judge the data transfer time of the dependent nodes?

elliottslaughter commented 1 year ago

There is a summary printed at the end which should include a bandwidth figure; but you may need to pass in the number of nodes (-nodes N) in order for it to accurately calculate the intra-node vs inter-node bandwidth.