avadhpatel / marss

PTLsim and QEMU based Computer Architecture Research Simulator
http://www.marss86.org
128 stars 63 forks source link

Job balance among cores #51

Open sunez opened 7 years ago

sunez commented 7 years ago

Thank you for your efforts on this simulator. It is very useful. However, I found that jobs are not equally distributed among cores. For example, I run 4 threads among 4 cores. I found that 1 or 2 cores are often not used to run my programs. If I am lucky, all 4 cores are used, but it happens not that often. Could you tell me what I missed and why it happens? And how can I fix it?

One example is as follows. As you can see ooo_0 didn't run user programs at all.

user.base_machine.ooo_0_0.thread0.commit.ipc = 0 user.base_machine.ooo_1_1.thread0.commit.ipc = 0.634201 user.base_machine.ooo_2_2.thread0.commit.ipc = 0.648642 user.base_machine.ooo_3_3.thread0.commit.ipc = 0.639732 kernel.base_machine.ooo_0_0.thread0.commit.ipc = 0.0147105 kernel.base_machine.ooo_1_1.thread0.commit.ipc = 0.173759 kernel.base_machine.ooo_2_2.thread0.commit.ipc = 0.565157 kernel.base_machine.ooo_3_3.thread0.commit.ipc = 0.173546 total.base_machine.ooo_0_0.thread0.commit.ipc = 0.0147105 total.base_machine.ooo_1_1.thread0.commit.ipc = 0.502229 total.base_machine.ooo_2_2.thread0.commit.ipc = 0.641751 total.base_machine.ooo_3_3.thread0.commit.ipc = 0.503832

sunez commented 7 years ago

By the way, if this is MarssX86 internal bug, it is a really critical one as a simulator. All the works with MarssX86 should not be trusted, as long as they used multi core configuration (but what is not multi-core nowadays). What I experienced is that it is very un-predictable. Sometimes, 2 cores run. Sometimes, 3 cores run. Worst case, it seems that none of cores executes user programs, so that they just stay (not deadlock), but never finish.

fitzfitsahero commented 7 years ago

I'm going to need way more information if you're looking for help.

What benchmark did you run? What configuration? For how long did you run the benchmark?

It is very possible that you never got out of the initialization phase.

sunez commented 7 years ago

Thank you for your response. The benchmark is what I made. It is a tree data structure with inserts and deletes. I built the simulator with 4 OOO_core (c=4) with Dramsim2. L1 and L2 is private, and L3 is shared cache with split bus. At the beginning of the program, it creates checkpoint. From the checkpoint, the program runs about 30 minutes (including 1~2 minutes of warming up before MarssX86 starts). Please let me know if you need more information.

By the way, how long does it take to get out of the initialization phase?

fitzfitsahero commented 7 years ago

Does your benchmark use pthreads or openmp or some other threading library?

sunez commented 7 years ago

Yes, it uses pthreads. And when shared data is accessed, mutex is used. Can it be a problem? If it is, could you recommend any other alternatives?

fitzfitsahero commented 7 years ago

We have images with parsec and splash, I would suggest you try them first.

I have run multicore simulations that effectively use all of the cores and get expected IPCs

sunez commented 7 years ago

Then, my question is how many thread have you run when you do your tests? As I know PARSEC creates quite numbers of threads. Do you know the details?

fitzfitsahero commented 7 years ago

I run as many threads as I have cores.

sunez commented 7 years ago

I just checked the posted parsec image and it seems that it uses thread-affinity. In MarssX86, if the thread uses thread-affinity, only one simulations is successfully running and all others are not proceeding (although it looks it does). I also tried thread-affinity in my benchmarks previously and I checked this issue already. I used thread-affinity tools provided by pthreads by the way. One simulation instance is also not possible configuration, because I have to run multiple simulations. Otherwise, it will take days to complete it. Could you check this issue and how to solve please?