accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
294 stars 114 forks source link

Issues with Trace Generation on V100: Persistent Zero Hits and Failures #263

Closed jntm closed 9 months ago

jntm commented 9 months ago

I've been encountering some problems after generating traces with a V100. Specifically, the 'Total page' count consistently shows as 1, and both L2D hit and miss rates are zero,among other issues. Interestingly, when I use traces provided officially, these issues do not arise.

Feeling that the issue might be with the trace generation process, I followed the tutorial steps again, including switching to a different GPU. However, the problem persists. My system environment is Ubuntu 20.04, CUDA-11.6, and GCC-9.4.0.

I'm not sure if the issue is related to CUDA version compatibility, driver problems, or compiler compatibility。

JRPan commented 9 months ago

Hi,

What exactly is "Total page"? I don't think we have such stats. And what is the application you are running? Could it be the case that your application has no locality at all? Which branch and commit are you on? We had a bug in the tracer that caused all LD addresses to be 0x0. Could you please check your traces to ensure all LD instructions are correct?

Thanks

jntm commented 9 months ago

Thank you for your response.

Yes, "Total page" is our new stats. I am using rodinia2.0-ft for BFS and backprop applications. This code project was given to me by my senior, and it works well in his environment. Could you guide me on how to check the traces to ensure that all LD instructions are correct? I am a beginner and currently not very clear about this process.

I'm sorry for bothering you again.

JRPan commented 9 months ago

My trace is under:

hw_run/traces/device-0/11.0/backprop-rodinia-2.0-ft/4096___data_result_4096_txt/traces/kernel-1.traceg

Your directory might be a little bit different. grep the traceg file like grep LD kernel-1.traceg. Also if would be helpful to post the first several lines of the traceg file.

Make sure all generic loads and global loads such as LD and LDG has valid address. It's okay if shared memory loads LDS has null addresses.

If you are sure you only see 0 loads with traces you generated yourself, but not with the traces we provided, then it's probably something wrong with the trace generation process. You can also diff your traces with what we provided.

And here is the issue https://github.com/accel-sim/accel-sim-framework/issues/127 that I was talking about. Take a look and see if this is your problem. If yes, pull the latest dev and you should be fine. Or you can just fix it manually if you prefer.

Thanks

jntm commented 9 months ago

0090 00010001 1 R11 LDG.E.SYS 1 R8 4 2 0x0 0 0170 ffffffff 1 R12 LDG.E.SYS 1 R2 4 1 0x0 0

thanks I only see 0 loads with my traces, but not with your traces. I will see issue 127 Thanks again for your help