PrincetonUniversity / openpiton

The OpenPiton Platform
http://www.openpiton.org
641 stars 215 forks source link

Multi-core simulation encountered 192cores bottleneck #157

Open zhb9103 opened 1 month ago

zhb9103 commented 1 month ago

Hi experts:

I git the openpiton_dev branch, and changed the code reference the second last Metro-MPI commit (https://github.com/metro-mpi/metro-mpi/commits/metro-mpi/ commit https://github.com/PrincetonUniversity/openpiton/commit/264b3659a9495ad2d52db7d74b28df962eec3f22).

I use "sims -sys=manycore -x_tiles=16 -y_tiles=12 -msm_build -ariane" generated 192 cores(or below 192 cores xy-tiles configuration), use "sims -sys=manycore -msm_run -x_tiles=4 -y_tiles=4 hello_world_many.c -ariane -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000" simulated and I can see Hello world, this is hart 0 of 16 harts! Hello world, this is hart 1 of 16 harts! Hello world, this is hart 2 of 16 harts! Hello world, this is hart 3 of 16 harts! Hello world, this is hart 4 of 16 harts! Hello world, this is hart 5 of 16 harts! Hello world, this is hart 6 of 16 harts! Hello world, this is hart 7 of 16 harts! Hello world, this is hart 8 of 16 harts! Hello world, this is hart 9 of 16 harts! Hello world, this is hart 10 of 16 harts! Hello world, this is hart 11 of 16 harts! Hello world, this is hart 12 of 16 harts! Hello world, this is hart 13 of 16 harts! Hello world, this is hart 14 of 16 harts! Hello world, this is hart 15 of 16 harts! information in the fake_uart.log

I use "sims -sys=manycore -x_tiles=16 -y_tiles=13 -msm_build -ariane" generated 208 cores(or above 192 cores xy-tiles configuration), use "sims -sys=manycore -msm_run -x_tiles=4 -y_tiles=4 hello_world_many.c -ariane -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000" simulated and waited a long time(above 12 hours), but I can't see any print in the fake_uart.log

Is there other limitation for above 192 cores?

Thanks!

Jbalkind commented 1 month ago

Can you instead try the hello_world_token.c test that I think should be released with metro-mpi? There's a software bottleneck in the test itself which that test should help with in place of _many.c

zhb9103 commented 1 month ago

Ok, I will try it, thank you very much!

zhb9103 commented 1 month ago

Ok, I will try it, thank you very much!

zhb9103 commented 1 month ago

I tried to use hello_world_token.c instead of hello_world_many.c to test, can't see any print in the fake_uart.log too. And I found all of the tracehart*.log files are empty, it represents no any communication in the test. Is there any data width requirement for above 192 cores?

After a while, I can see the follow information in the trace_hart_5.log Exception @ 34127500, PC: 000000fff1010000, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 37715500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 39707500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 43025500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 46561500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 49670500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 52765500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 55861500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 58970500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 62065500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 65161500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 68267500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 71365500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 74460500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 77556500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 80665500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 83760500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 86856500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 89965500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 93060500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 96156500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 99265500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 102360500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 105455500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 108551500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 111660500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 114755500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 117851500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 120960500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 124055500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 127151500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 130260500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 133355500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 136450500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 139557500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 142655500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 145750500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 148846500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 151955500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 155050500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 158146500, PC: 000000fff1010040, Cause: Ille

guillemlp commented 1 month ago

I am quite intrigued by the results you are getting. I have experienced similar problems in the past. You are using the commit of metro_mpi but you are not simulating w metro_mpi right? Can you point me which hello_world.c/hello_world_many.c are you using?

zhb9103 commented 1 month ago

Hi @guillemlp

Reply to you as below:

You are using the commit of metro_mpi but you are not simulating w metro_mpi right? ---> I am not really using the metrol_mpi project, I just change *_LSID width and relevant values base on the openpiton_dev project reference the metro_mpi. Done that, I can get more than 64 cores, but I encountered 192 cores bottleneck now.

Can you point me which hello_world.c/hello_world_many.c are you using? --->Yes I am using the hello_world_many.c to test now. I used the hello_world_token.c to test before, but nothing print in the fake_uart.log.

Thanks!

zhb9103 commented 1 month ago

I retried the test with hello_world_token.c, the phenomenon as the tested with hello_world_many.c

guillemlp commented 1 month ago

can you verify if argv variable in main is char or int? (should be int if you are using more than 64 cores) have you tried 128 cores doing the hello world token correctly?

zhb9103 commented 1 month ago

Hi @guillemlp:

can you verify if argv variable in main is char or int? (should be int if you are using more than 64 cores) ---> I have done that, the related code as below:

  1. syscalls.c int attribute((weak)) main(int argc, int** argv) { // single-threaded programs override this function. printstr("Implement main(), foo!\n"); return -1; } ... // always init all threads void _init(int cid, int nc) { volatile static uint32_t finish_sync0 = 0; volatile static uint32_t finish_sync1 = 0;

    //char num[2] = {cid, nc}; //char argv[1] = {num}; int num[2] = {cid, nc}; int argv[1] = {num}; int ret = main(2, argv);

    ATOMIC_OP(finish_sync0, 1, add, w); //asm volatile ( " amoadd.w zero, %1, %0" : "+A" (finish_sync0) : "r" (1) : "memory"); while(finish_sync0 != nc);

    // synchronize for debug output below while(finish_sync1 != cid);

    char buf[NUM_COUNTERS 32] attribute((aligned(64))); char pbuf = buf; for (int i = 0; i < NUM_COUNTERS; i++) if (counters[i]) pbuf += sprintf(pbuf, "core %d: %s = %d\n", cid, counter_names[i], counters[i]); if (pbuf != buf) printstr(buf);

    ATOMIC_OP(finish_sync1, 1, add, w); //asm volatile ( " amoadd.w zero, %1, %0" : "+A" (finish_sync1) : "r" (1) : "memory");

    exit(ret); ...

  2. hello_world_many.c int main(int argc, int** argv) {

    // synchronization variable volatile static uint32_t amo_cnt = 0;

    // synchronize with other cores and wait until it is this core's turn while(argv[0][0] != amo_cnt);

    // assemble number and print printf("Hello world, this is hart %d of %d harts!\n", argv[0][0], argv[0][1]);

    // increment atomic counter ATOMIC_OP(amo_cnt, 1, add, w);

    return 0; } ... These changes are fine for below 192 cores. But it is not work for above 192 cores.

have you tried 128 cores doing the hello world token correctly? ---> Yes, I have tried it, it is working well. The print in the fake_uart.log as below: 0 10 1 10 2 10 3 10 ...

Further more, I have tried 192 cores to test with hello_world_token.c, it is working well too. And have tried 208 cores, it is not work, it looks the fetched instruction is incorrect.

Thank!

guillemlp commented 1 month ago

Which NoC sizes are you playing with? I have only tried 128/256/512/1024 cores can you try 256 e.g. 16X16 NOC ?

zhb9103 commented 1 month ago

Hi @guillemlp:

I have tried 16*16 cores configuration, it doesn't work. My steps as below:

  1. SOC building command sims -sys=manycore -x_tiles=16 -y_tiles=16 -vcs_build -ariane -config_rtl=MINIMAL_MONITORING

  2. SOC simulation command sims -sys=manycore -vcs_run -x_tiles=16 -y_tiles=16 hello_world_token.c -ariane -config_rtl=MINIMAL_MONITORING -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000

after a while, I can see the information in the tracehart*.log as below: Exception @ 34127500, PC: 000000fff1010000, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 37715500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 39707500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 43025500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 ...

It is a little bit difficult to find the rootcause.

I will git clone MPI project to do the test(256 cores or above). I think I might have lost something.

Thanks!

zhb9103 commented 1 month ago

Hi @guillemlp:

I have done steps as below on the MPI project:

  1. sims -sys=manycore -x_tiles=16 -y_tiles=16 -vcs_build -ariane -config_rtl=MINIMAL_MONITORING
  2. sims -sys=manycore -vcs_run -x_tiles=16 -y_tiles=16 hello_world_token.c -ariane -config_rtl=MINIMAL_MONITORING -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000

But I can see the information in the tracehart*.log as below: Exception @ 34127500, PC: 000000fff1010000, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 37715500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 39707500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 Exception @ 43025500, PC: 000000fff1010040, Cause: Illegal Instruction, tval: 0000000000000000 ...

I don't know what wrong I did. Could you help to check for me?

Thanks!