Closed jjl1075337132 closed 1 month ago
Hi @jjl1075337132 there are many things to consider in here.
(1) The printf
running in each core in parallel causes contentions! The way the printf
works is that each core writes to one part of the memory. Now if all accelerators try to write something, not all printf
will happen that's why core 0 and 1 are missing.
I suggest to use only 1 core to do the printing.
(2) The accelerator cycles can be different due to memory bank contentions. The ALU stalls whenever there are bank contentions. By contentions meaning, more than one accelerator accesses the same bank twice. For example, accessing data at address 0 and address 256 access the same bank. If this happens, one accelerator gets the data first then the other one next.
Make sure to avoid bank contentions!
Thank you for your answer, I understand what you mean, but I feel that using one core to print whether the execution of other cores is correct and the time spent by the accelerators of other cores seems to be impossible. I don't know if you have any ideas?
Hi @jjl1075337132
Technically speaking, you wouldn't want to do printing in the first place. We print only to debug our systems.
The only way to know if accelerators process the correct data is to not increment err
variable.
If you notice from our examples, the err
variable was meant as an error signal. If an error occurs (e.g., incorrect value or some incorrect sequence) we increment err
.
Then we return err
that is not equal to 0. This way we can check if an accelerator is correct or not.
The problem with printf
really, is more like it was just meant for debugging purpose. It's in fact, not a good idea to put printf
into our code if we measure performance.
If you follow the traces later (e.g., look at waveforms) you will see that the printf
takes sooooo many cycles. So it's not meant to be there in the first place.
I hope you understand what I mean.
I'm grateful for your help.
I will close this issue now 😄 feel free to ask more questions in another issue! Thanks for your interest! @jjl1075337132
1 (2).txt I configured 4 cores + snax-alu and then quadrupled the size of the original data, then I used one core + snax-alu to process one copy of the data. I found that if(snrt_cluster_core_idx() == 2 ){}else if(snrt_cluster_core_idx() == 3) vs. if(snrt_cluster_core_idx() == 2 ){} if(snrt_cluster_core_idx() == 3) output is different, the former only printf : core 0 Accelerator Done! Accelerator Cycles: 26 Accelerator Cycles: 26 Accelerator Cycles: 26 Number of errors: 0 Number of errors: 0 Accelerator Cycles: 26 Accelerator Cycles: 26 Accelerator Cycles: 26 Accelerator Cycles: 26 The latter will output
missing two printf outputs: core 0 Accelerator Done! core 1 Accelerator Done!