bespoke-silicon-group / bsg_replicant

BSG Replicant: Cosimulation and Emulation Infrastructure for HammerBlade
BSD 3-Clause "New" or "Revised" License
26 stars 20 forks source link

Performance Problem warning when running regression tests in cosim #449

Open save-buffer opened 4 years ago

save-buffer commented 4 years ago

        Offending '((out_credits_o === 'x) || (out_credits_o > 5'b0))'
## out of remote store credits(= 0) x,y=0,0 displaying only once (tb.card.fpga.CL.axil_to_mcl_inst.genblk4[0].mcl_endpoint_standard)
##   (this may be a performance problem; or normal behavior)```
save-buffer commented 4 years ago

Happens at the start up of every regression test (as far as I can tell)

leonardxiang commented 4 years ago

Hi, Sasha, I guess this warning is not too serious. It indicates that the host is ejecting request packets to some node(s) at a rate greater than that of the node can give response back, so that the out_credits of the manycore_link become 0, and trigger this warning.

taylor-bsg commented 4 years ago

Hi, that is not exactly what the warning is saying. The parameter sets how many stores can be sent without hearing back a response. The further away the destination, the longer the latency to get there, so the more words you would expect to send before hearing back. The assert is saying the value is likely set too small. In cases of a congested network, increasing the value can result in more congestion since you are pushing more packets into the network. But I think generally with the host this is not a problem, since it is just one node on the network. What is the current setting of the value?

On Wed, Oct 30, 2019 at 9:55 PM leonardxiang notifications@github.com wrote:

Hi, Sasha, I guess this warning is not too serious. It indicates that the host is ejecting request packets to some node(s) at a rate greater than that of the node can give response back, so that the out_credits of the manycore_link become 0, and trigger this warning.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_f1/issues/449?email_source=notifications&email_token=AEFG5AHTYLCLZSX56CS5IHLQRJQNTA5CNFSM4JHCKTDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECWR6GY#issuecomment-548216603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFG5AEE6LBF2XKLWFZ6IZ3QRJQNTANCNFSM4JHCKTDA .

leonardxiang commented 4 years ago

cases

Hi, Michael, the out_credits is set to 16 now.

Yes, the longer the distance from the host to the node, the longer the request-response round trip cycle time, and this would cause out of credits.

The above was another theory: Since the warning occurs at the beginning of the simulation. We write the program tile by tile, starting from (1, 0). I guess it be possible that one tile can not response faster than the incoming store requests ?

I will post the details later...

taylor-bsg commented 4 years ago

See inline!

On Thu, Oct 31, 2019 at 2:27 PM leonardxiang notifications@github.com wrote:

cases

Hi, Michael, the out_credits is set to 16 now.

I think it’s definitely worth increasing the credits and seeing when the execution time stops improving. And also computing the bandwidth delay product; see below.

16 needs to cover the round trip delay, so it is saying that we think that the tile is less than 8 cycles away, which I think is probably not true. But you are right that if there is a bottleneck somewhere, the right number of credits could be lower.

Yes, the long distance from the host to the node, the longer the

request-response round trip cycle time, and this would cause out of credits.

The above was another theory: Since the warning occurs at the beginning of the simulation. We write the program tile by tile, start from (1, 0). I guess it be possible that one tile can not response fast enough than the incoming store request ?

Interesting hypothesis! One way to bound the credits parameter is to look at the bandwidth X delay product of that path when the network is not loaded. The bandwidth (in words per cycle) is the rate of the narrowest bottleneck. So for example, if the tile can only accept one word every other cycle, then it would be 1 word / 2 cycles. The other parameter is the round trip delay, in cycles, of the packet going from the sender to the endpoint and the credit returning back to the sender. (if you have multiple clocks, you can use ns instead of cycles). Multiply those together, and there is no benefit to setting the parameter higher. Generally we want to avoid allowing injecting packets into the network if they do not improve performance, because congestion will go up and performance will not improve. It may be that we want to set the parameter lower than this number if we expect congestion to be high in the network. So for another example, if we have a bottleneck that allows one packet every hundred cycles, and then there is 100 cycles of latency after that bottleneck, and then the return path of the credit is symmetric, then we would want 1/100 [(100+100) 2] = 4 credits.

In this case, if there was a FIFO with 100 elements of space before the bottleneck, then it is true that we would see a bunch of out of credit warnings and that would be totally okay because just because we can fill up that FIFO, it only adds more congestion.

One interesting thing would be for @bornaehsani to make Icache write stalls take priority over synchronization stalls in the blood graph. Then we can see what the bandwidth is of that path by seeing the spacing between the writers during code loading time.

I will post the details later...

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_f1/issues/449?email_source=notifications&email_token=AEFG5AEDFNFOB5IPMLEGM7DQRNEULA5CNFSM4JHCKTDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECZJWPY#issuecomment-548576063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFG5AAMMVXI4NSEXUUTXHDQRNEULANCNFSM4JHCKTDA .

drichmond commented 4 years ago

Can we close this? I believe this is just a warning that occurs because it is an artifact of our clock-switching in Cosimulation, or is there a concrete issue we want to track?

taylor-bsg commented 4 years ago

I think it’s a concrete issue, but doesn’t need to be looked at until the next iteration of “how can we improve cosim performance ?”

On Mon, Dec 2, 2019 at 3:34 PM Dustin Richmond notifications@github.com wrote:

Can we close this? I believe this is just a warning that occurs because it is an artifact of our clock-switching in Cosimulation, or is there a concrete issue we want to track?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_f1/issues/449?email_source=notifications&email_token=AEFG5AEUQ6FXU5RYC3EHDTLQWWLOZA5CNFSM4JHCKTDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFXLSEA#issuecomment-560904464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFG5ACHQECSAO3JWCNTMA3QWWLOZANCNFSM4JHCKTDA .

taylor-bsg commented 4 years ago

Maybe next step is for Borna to make the suggested hack to bloodgraphs?

On Mon, Dec 2, 2019 at 4:53 PM Michael Nguyen Taylor < michael.b.taylor@gmail.com> wrote:

I think it’s a concrete issue, but doesn’t need to be looked at until the next iteration of “how can we improve cosim performance ?”

On Mon, Dec 2, 2019 at 3:34 PM Dustin Richmond notifications@github.com wrote:

Can we close this? I believe this is just a warning that occurs because it is an artifact of our clock-switching in Cosimulation, or is there a concrete issue we want to track?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_f1/issues/449?email_source=notifications&email_token=AEFG5AEUQ6FXU5RYC3EHDTLQWWLOZA5CNFSM4JHCKTDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFXLSEA#issuecomment-560904464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFG5ACHQECSAO3JWCNTMA3QWWLOZANCNFSM4JHCKTDA .

drichmond commented 4 years ago

This might be solved in #503?

leonardxiang commented 4 years ago

No, the warning still exists. Maybe solved after issue #510

drichmond commented 4 years ago

Ah, yes. @leonardxiang, can you tell us how many manycore cycles it takes to inject a packet into a network?

leonardxiang commented 4 years ago

In the io clock domain, It takes 30 cycles in total to send a packet to the network. 7.5*4 = 30, which means 7~8 cycles for each host AXIL write (4 AXIL writes make a manycore network packet). Note that the new MCL only needs 2 cycles to finish a AXIL write though, there are extra cycles (5.5) introduced by the host function call.

In the manycore network domain, the packet is injected to the network 1 per cycle. Because we boost the io interface 50X.

drichmond commented 4 years ago

Sorry, I'm confused.

Are you reporting 30 cycles for the older interface?

leonardxiang commented 4 years ago

It is the new one, The older interface takes ~116 cycles.

drichmond commented 4 years ago

I'm confused about the difference between 2 cycles for a write request and 30 cycles to send a packet.

drichmond commented 4 years ago

Oh, you're saying each write request is 2 cycles? And there are 4 write requests to make a packet?

drichmond commented 4 years ago

So with a 50:1 clock ratio (What I believe is currently configured) are we injecting a packet onto the network on every network cycle?

leonardxiang commented 4 years ago

Yeah, 4 AXIL writes make a manycore network packet.

And on network clock domain, it is 1 manycore packet injected per cycle.

drichmond commented 4 years ago

With the 50:1 ratio? Can you try 30:1 and 25:1? Just curious where the break-even point is

leonardxiang commented 4 years ago

It is 30:1, any ratio greater than 30 will cause the above performance problem warning.

drichmond commented 4 years ago

What makes you say that? I see a 50:1 ratio in cl_manycore.sv

leonardxiang commented 4 years ago

Well, I changed that and run some io tests, like test_manycore_credits, test_manycore_eva_read_write, then I got that observation.

If the throughput of mcl is not larger than that of the network, we could expect that the the io will not run out of credits?

drichmond commented 4 years ago

Ahhh. Gotcha. Is there a difference in runtime between 30:1 and 1:1?

leonardxiang commented 4 years ago

The runtime of test_manycore_eva_read_wite has about 1.7X slow down if the ratio is changed from 30:1 to 1:1.

But I still think the slow down of runtime can be compensated by removing the cdc module.

You can have a more extensive summary like what is nicely reported on issue #510 ;)

drichmond commented 4 years ago

No, that's sufficient for now. Thanks!