CMUAbstract / POPT-CacheSim-HPCA21

MIT License
18 stars 7 forks source link

Pin caused signal 11 #1

Closed Irr-free closed 3 years ago

Irr-free commented 3 years ago

sorry to disturb you, but when i was using Pin-2.14 ( the version you offer in download_pin.py ) to record the results, i met the error "C : Tool ( or Pin ) caused signal 11 at PC 0x5764b15037". It just occured when i use policy "popt-8b" and "pot-ideal" to run the graph "hugebubbles-00020". But the result of policy "lru" and "drrip" is correct. Then i looked over the two "xxxx.dat" files and , i also found the "[LLC-STAT] Total Misses" value in "out_pr_hugebubbles-00020_popt-8b_popt.dat" is 0, the "[LLC-STAT] Total Misses" value in "out_pr_hugebubbles-00020_opt-ideal_opt-ideal.dat" just equals to that when i used the other two policys "lru" and "drrip". I don't know if this is the graph "hugebubbles-00020"'s problem or Pin-Tool's problem.

bvignesh commented 3 years ago

Hmmm, this is interesting -- were you able to test the pintools on a different input graph? I remember running these simulators with the hugebubbles-00020 graph because it was one of the inputs we used in the paper.

Can you please try to run the simulators under gdb and report what you find? (In case you havent debugged a pintool before) Here are the steps I would take:

  1. Compile the pintool without any optimizations. If I remember correctly setting the DEBUG environment variable to 1 and running make should compile without optimizations
  2. Then follow the steps in the pin manual over here pin-2.14-debugging
Irr-free commented 3 years ago

Thanks very much, professor. I'm reproducing the experiments in your paper. Hah, i found another interesting result. When I use the other two graphs "Long_Coup_dt6"(https://suitesparse-collection-website.herokuapp.com/Janna/Long_Coup_dt6) and "in-2004"(https://suitesparse-collection-website.herokuapp.com/LAW/in-2004), the result of the experiment is that no matter which policy is used, the ratio between the it's final result and the result obtained by using the lru policy is the same. Here is my result: https://imgur.com/a/xrsCtlw

So, Is the data I ran out of the way wrong or is it the result? Woo.... I think some of my mistakes caused this result (face with tears of joy) I will run the simulators under gdb and report what i find next.

bvignesh commented 3 years ago

Hi, there is an explanation for the results you are seeing with the two new graphs -- the structure and/or ordering of these graphs already provides good cache locality. For the Long_Coup graph, if you look at the adjacency matrix you can see that all the non-zero elements are along the diagonal (this ensures great spatial and temporal reuse since the neighborhoods of consecutive vertices have high overlap). For the in-2004 input graph, they are from the LAW dataset. The authors of the LAW dataset use a pretty sophisticated reordering mechanism (I think it is called layered-label-propagation) which ensures that their graphs have high cache locality. So in a nutshell, if the graph's structure/vertex-ordering already provides great cache locality then there is little headroom for improvement through better cache replacement

If you try to randomize both these graphs (if you look at the download_and_build_graphs.py script you will find the randomizer app we use), you should be able to see cache miss reductions with P-OPT and Optimal replacement.

I also tried to run the hugebubbles-00020 graph with the popt simulator and I was able to get valid results on my system (Debian - stretch with kernel 3.16). Note that the .dat files have two sets of entries for [LLC-STAT] Total misses = ... (the first entry will be the actual value from the simulation and the second will always be 0). This was because of an experimental feature and I will clean the results up soon to report just one LLC stats.

bvignesh commented 3 years ago

Please let me know what you find with gdb -- this could be a system-specific bug but I would like to resolve it if possible

Irr-free commented 3 years ago

Hi professor. I have tried run the simulators under gdb and here is what I found. When I used gdb connect to pin and then run the program run_cache_sims.py with policy "p-opt" and graph "hugebubbles-00020"。Then I got the error : "Program received signal SIGSEGV, Segmentation fault. LLC:: reportEvictionReasons (this=x7f0c2efd8e98 <cache+216>) at llc.cpp: 408 408 evictCtr += m _eviction_reason[tid] [dEvictee] [dEvicter];" Then I downgraded my Debian kernel to 3.16 and tried application "cc_sv", I also got the same error. I don't if this can help you.

During the experiment, I also found another situation that made me dumbfounded. That is, my graph hugebubbles-00020 has only 0 edges!!! God...gesus... Shouldn't it have 63.58M? But I did use the randomizer app to get the hugebubbles-00020.sg Excuse me, professor, did I get the correct graph? If it is not correct, why do I get such a result. 🤔

But anyway, I learned how to debug with pintool this time. Thanks a lot again!

bvignesh commented 3 years ago

Looking at the gdb error message it appears that the error might be particularly with the cc_sv application. Can you please verify whether the simulations work with the PageRank (pr) application? (During my previous message when I mentioned that the simulations were successful, I was referring particulartly to the pr application). I will take a look at the cc_sv application in the next couple of days and try to resolve the bug. If you share dEvictee and dEvicter values that you would also be helpful. The segfault is probably because one of these values is outside the legal range (0 to m_numMainDataTypes).

Sometimes, when available memory is low, then the simulation might end with 0 edges (and the correct number of vertices) reported. This happens because the memory allocation during CSR construction failed. So you should probably check whether you have sufficient memory to run large graphs. If, on the other hand, both number of edges and vertices are 0 then you probably have a broken input graph.

bvignesh commented 3 years ago

Closing this issue because of lack of activity... Feel free to reopen if there are any unresolved issues