Open zhb9103 opened 1 month ago
Can you instead try the hello_world_token.c test that I think should be released with metro-mpi? There's a software bottleneck in the test itself which that test should help with in place of _many.c
Ok, I will try it, thank you very much!
Ok, I will try it, thank you very much!
I tried to use hello_world_token.c instead of hello_world_many.c to test, can't see any print in the fake_uart.log too. And I found all of the tracehart*.log files are empty, it represents no any communication in the test. Is there any data width requirement for above 192 cores?
After a while, I can see the follow information in the trace_hart_5.log Exception @ 34127500, PC: 000000fff1010000, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 37715500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 39707500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 43025500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 46561500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 49670500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 52765500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 55861500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 58970500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 62065500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 65161500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 68267500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 71365500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 74460500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 77556500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 80665500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 83760500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 86856500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 89965500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 93060500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 96156500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 99265500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 102360500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 105455500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 108551500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 111660500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 114755500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 117851500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 120960500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 124055500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 127151500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 130260500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 133355500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 136450500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 139557500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 142655500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 145750500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 148846500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 151955500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 155050500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 158146500, PC: 000000fff1010040, Cause: Ille
I am quite intrigued by the results you are getting. I have experienced similar problems in the past. You are using the commit of metro_mpi but you are not simulating w metro_mpi right? Can you point me which hello_world.c/hello_world_many.c are you using?
Hi @guillemlp
Reply to you as below:
You are using the commit of metro_mpi but you are not simulating w metro_mpi right? ---> I am not really using the metrol_mpi project, I just change *_LSID width and relevant values base on the openpiton_dev project reference the metro_mpi. Done that, I can get more than 64 cores, but I encountered 192 cores bottleneck now.
Can you point me which hello_world.c/hello_world_many.c are you using? --->Yes I am using the hello_world_many.c to test now. I used the hello_world_token.c to test before, but nothing print in the fake_uart.log.
Thanks!
I retried the test with hello_world_token.c, the phenomenon as the tested with hello_world_many.c
can you verify if argv variable in main is char or int? (should be int if you are using more than 64 cores) have you tried 128 cores doing the hello world token correctly?
Hi @guillemlp:
can you verify if argv variable in main is char or int? (should be int if you are using more than 64 cores) ---> I have done that, the related code as below:
syscalls.c int attribute((weak)) main(int argc, int** argv) { // single-threaded programs override this function. printstr("Implement main(), foo!\n"); return -1; } ... // always init all threads void _init(int cid, int nc) { volatile static uint32_t finish_sync0 = 0; volatile static uint32_t finish_sync1 = 0;
//char num[2] = {cid, nc}; //char argv[1] = {num}; int num[2] = {cid, nc}; int argv[1] = {num}; int ret = main(2, argv);
ATOMIC_OP(finish_sync0, 1, add, w); //asm volatile ( " amoadd.w zero, %1, %0" : "+A" (finish_sync0) : "r" (1) : "memory"); while(finish_sync0 != nc);
// synchronize for debug output below while(finish_sync1 != cid);
char buf[NUM_COUNTERS 32] attribute((aligned(64))); char pbuf = buf; for (int i = 0; i < NUM_COUNTERS; i++) if (counters[i]) pbuf += sprintf(pbuf, "core %d: %s = %d\n", cid, counter_names[i], counters[i]); if (pbuf != buf) printstr(buf);
ATOMIC_OP(finish_sync1, 1, add, w); //asm volatile ( " amoadd.w zero, %1, %0" : "+A" (finish_sync1) : "r" (1) : "memory");
exit(ret); ...
hello_world_many.c int main(int argc, int** argv) {
// synchronization variable volatile static uint32_t amo_cnt = 0;
// synchronize with other cores and wait until it is this core's turn while(argv[0][0] != amo_cnt);
// assemble number and print printf("Hello world, this is hart %d of %d harts!\n", argv[0][0], argv[0][1]);
// increment atomic counter ATOMIC_OP(amo_cnt, 1, add, w);
return 0; } ... These changes are fine for below 192 cores. But it is not work for above 192 cores.
have you tried 128 cores doing the hello world token correctly? ---> Yes, I have tried it, it is working well. The print in the fake_uart.log as below: 0 10 1 10 2 10 3 10 ...
Further more, I have tried 192 cores to test with hello_world_token.c, it is working well too. And have tried 208 cores, it is not work, it looks the fetched instruction is incorrect.
Thank!
Which NoC sizes are you playing with? I have only tried 128/256/512/1024 cores can you try 256 e.g. 16X16 NOC ?
Hi @guillemlp:
I have tried 16*16 cores configuration, it doesn't work. My steps as below:
SOC building command sims -sys=manycore -x_tiles=16 -y_tiles=16 -vcs_build -ariane -config_rtl=MINIMAL_MONITORING
SOC simulation command sims -sys=manycore -vcs_run -x_tiles=16 -y_tiles=16 hello_world_token.c -ariane -config_rtl=MINIMAL_MONITORING -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000
after a while, I can see the information in the tracehart*.log as below: Exception @ 34127500, PC: 000000fff1010000, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 37715500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 39707500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 43025500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 ...
It is a little bit difficult to find the rootcause.
I will git clone MPI project to do the test(256 cores or above). I think I might have lost something.
Thanks!
Hi @guillemlp:
I have done steps as below on the MPI project:
But I can see the information in the tracehart*.log as below: Exception @ 34127500, PC: 000000fff1010000, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 37715500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 39707500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 43025500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 ...
I don't know what wrong I did. Could you help to check for me?
Thanks!
Hi experts:
I git the openpiton_dev branch, and changed the code reference the second last Metro-MPI commit (https://github.com/metro-mpi/metro-mpi/commits/metro-mpi/ commit https://github.com/PrincetonUniversity/openpiton/commit/264b3659a9495ad2d52db7d74b28df962eec3f22).
I use "sims -sys=manycore -x_tiles=16 -y_tiles=12 -msm_build -ariane" generated 192 cores(or below 192 cores xy-tiles configuration), use "sims -sys=manycore -msm_run -x_tiles=4 -y_tiles=4 hello_world_many.c -ariane -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000" simulated and I can see Hello world, this is hart 0 of 16 harts! Hello world, this is hart 1 of 16 harts! Hello world, this is hart 2 of 16 harts! Hello world, this is hart 3 of 16 harts! Hello world, this is hart 4 of 16 harts! Hello world, this is hart 5 of 16 harts! Hello world, this is hart 6 of 16 harts! Hello world, this is hart 7 of 16 harts! Hello world, this is hart 8 of 16 harts! Hello world, this is hart 9 of 16 harts! Hello world, this is hart 10 of 16 harts! Hello world, this is hart 11 of 16 harts! Hello world, this is hart 12 of 16 harts! Hello world, this is hart 13 of 16 harts! Hello world, this is hart 14 of 16 harts! Hello world, this is hart 15 of 16 harts! information in the fake_uart.log
I use "sims -sys=manycore -x_tiles=16 -y_tiles=13 -msm_build -ariane" generated 208 cores(or above 192 cores xy-tiles configuration), use "sims -sys=manycore -msm_run -x_tiles=4 -y_tiles=4 hello_world_many.c -ariane -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000" simulated and waited a long time(above 12 hours), but I can't see any print in the fake_uart.log
Is there other limitation for above 192 cores?
Thanks!