Need for Speed: Make Accel-Sim FAST again

JRPan commented 5 months ago

Timing profile from linux perf:

cache_stats are called per-cycle by Accel-Wattch.
xbar advance iterates all input->output which is O(n^2) which n is number of nodes. This is crazy.

These two functions add up to 2/3 of the simulation time. 28+21+17 = 66%.

Fix:

include a global cache stat. So, you don't need to check for each SM each cycle. This is already done in dev-stream-stats branch. But needs some final polishing before being merged. For now, disable it if Accel-Wattch is not in use
only check the nodes that have an input/output. Even though, in the worst case, every node has a package, and the function still needs to iterate through all nodes but I doubt how often that happens. Anyway the worst case is the same, but on average this should be much better.

JRPan commented 5 months ago

See https://github.com/accel-sim/gpgpu-sim_distribution/pull/67 for the fix.

rodhuega commented 5 months ago

Hi, I have been studying the fix. I have one suggestion and one question.

Let's start with the suggestion. Regarding the cache stats, the fix only works in case the execution is not using Accel-Wattch. However, many of us need that component. I propose to change those if that you have added to be: if(m_config.g_power_simulation_enabled && (((gpu_tot_sim_cycle + gpu_sim_cycle) + 1) % m_config.gpu_stat_sample_freq == 0)) instead of if (m_config.g_power_simulation_enabled) . In that way, the fix is also working in case Accel-Wattch is enabled. Notice the need for +1 because the mcpat_cycle function is triggered after gpu_sim_cycle++;. I have checked the energy report, and it provides the same outputs.

Moving to the question. I have tested the fix of the ICNT, and provides me different results. For example, a testing kernel that I have used to study the fix requires 2 cycles more to finish the execution. I wonder if this is expected or even if it is a better ICNT model. So, I would like to know if this matters or not.

PD: Thank you for the fix, I really appreciated.

JRPan commented 5 months ago

Thanks! Your solution regarding cache stats is better. I'll include that and keep that when we are adding the global cache stats.

Regarding ICNT, I expected it to be the same. I only intended to reduce the loop count. Do you mean it takes 2 cycles more in total to finish the execution? Or does each inst take 2 cycles more? Anyway, I have to test it more thoroughly. That's why I marked the PR as a draft and want input from the community. Thanks again, I will test it more.

A little side talk: are you the author of this paper https://arxiv.org/pdf/2401.10082.pdf? We really like this work and it would be great if we could merge this. Is this something you may be interested in?

JRPan commented 5 months ago

Actually, I think I accidentally fixed a bug. lol. Someone correct me if I am wrong

in the original code:

for each output node:
    for each input node:
        if input_buffer[0] can be transferred from input node to output node:
            remove package from input buffer and insert into output buffer

But imagine an extreme case: input buffer 0 has 10 packages for output nodes 0 to 10. So: input_buffer[] = 0,1,2 ... 9

Then all 10 requests can be served within one cycle, which is wrong in design iSLIP with a single iteration. Each input should only be able to send 1 request each cycle.

Now in my design each output node can only accept one package per cycle.

Some background reading http://nms.lcs.mit.edu/6829-papers/islip-ton.pdf:

Again, I may be wrong. Inputs are highly welcomed.

rodhuega commented 5 months ago

Regarding ICNT, I expected it to be the same. I only intended to reduce the loop count. Do you mean it takes 2 cycles more in total to finish the execution? Or does each inst take 2 cycles more? Anyway, I have to test it more thoroughly. That's why I marked the PR as a draft and want input from the community. Thanks again, I will test it more.

It is reporting 2 cycles more of the stat gpu_sim_cycle of the first kernel of Backprop that belongs to Rodinia 2. As an effect of that, many other stats were changing too. It was in my simulator version, but I suppose that is going to be similar in the current dev version as I have never changed the ICNT between SMs and memory partitions.

Actually, I think I accidentally fixed a bug. lol. Someone correct me if I am wrong

in the original code:
for each output node:
    for each input node:
        if input_buffer[0] can be transferred from input node to output node:
            remove package from input buffer and insert into output buffer
But imagine an extreme case: input buffer 0 has 10 packages for output nodes 0 to 10. So: input_buffer[] = 0,1,2 ... 9

Then all 10 requests can be served within one cycle, which is wrong in design iSLIP with a single iteration. Each input should only be able to send 1 request each cycle.

Now in my design each output node can only accept one package per cycle.

Some background reading http://nms.lcs.mit.edu/6829-papers/islip-ton.pdf:

Again, I may be wrong. Inputs are highly welcomed.

I'm not expert in ICNT, but I agree that the algorithm is saying that only accepts one input grant each cycle. However, I'm not sure if I am seeing where that is prevented compared to the original code. I suppose that the prevention comes from the original body loop

for (unsigned i = 0; i < total_nodes; ++i) {
  if (Has_Buffer_Out(i, 1)) {
    for (unsigned j = 0; j < total_nodes; ++j) {
....
       if (grant_cycles_count == 1)
              next_node[i] = (++node_id % total_nodes);

to your new loop that goes over the set, but I'm not identifying where. I guess that is something related with start_node, but I'm not getting how it is prevented that all the dest can have the same next_node[dest] in a given cycle.

A little side talk: are you the author of this paper https://arxiv.org/pdf/2401.10082.pdf? We really like this work and it would be great if we could merge this. Is this something you may be interested in?

Yes, I'm one of the authors. I will be interested in merging that, but not right now. That arxiv PDF was a work in progress that we presented in CAMS 2023 and it is unfinished. We can discuss merging it when it is finish.

JRPan commented 5 months ago

It's prevented when building the set. https://github.com/accel-sim/gpgpu-sim_distribution/pull/67/files#diff-87ad483e1fe95a02fd73cf3251edacacbd39639eafa43b93150042ca68a0b511R190 In my code:

for each input
    package = input_node.head()
    if has package:
        input_set.insert(input_node)
        output_set.insert(output_node)

So I'm only checking the head of the input buffer. And both sets are unique. Then later in the nested loop, each input node and output node can only be iterated once. This prevented the bug.

Of course it can still be improved. The temp_node is redundant. But functional-wise this is the same.

rodhuega commented 5 months ago

Ok, understood. I agree with you, now it is more correct.

By the way, I think these two modifications can also be applied to Vulkan-sim. Which I think is even slower than Accel-sim. I suppose this will have a good impact in that simulator too.

JRPan commented 5 months ago

Good idea! Thanks.

rodhuega commented 4 months ago

Hi, I have tested your new version of the code, and produces different results compared to the other two versions. I mean, original has IPC A, intermediate IPC B, and the current version has IPC C. I don't know if you were awere of the changes in IPC between intermediate code that we were discussing some days ago and the current state.

JRPan commented 4 months ago

Thank you for pointing that out.

Yes, it should have some minor effects. The intermediate version was half-fixed. I would expect original IPC > intermediate IPC > current IPC. Is this correct?

In the intermediate version, an input node can still transfer multiple packages in some corner cases.

However, I did not have time to thoroughly test everything (correlate every benchmark in the original paper). Our cluster is scheduled for upgrade. For now, I do not recommend using this for any results that go into a paper.

But again, thanks for the input. I'll provide a full correlation once this change is ready.

rodhuega commented 4 months ago

Ok, I understand.

Regarding the IPC performance between versions, I have the following comparison with Backprop of Rodinia 2: Initial: gpu_tot_ipc = 361.4865 gpu_tot_ipc = 291.3595

Intermediate: gpu_tot_ipc = 361.7782 gpu_tot_ipc = 291.9215

Final: gpu_tot_ipc = 361.3072 gpu_tot_ipc = 291.6887

Until the first kernel: Intermediate > Initial > Final

In the end: Intermediate > Final > Initial

In my personal opinion, I won't care if the IPC is greater or smaller. It will change depending on the kernel. Moreover, the change is very little. I care more about the fact that now the code is doing what is expected.

Thanks for fixing that.

accel-sim / accel-sim-framework

Need for Speed: Make Accel-Sim FAST again #284