accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
294 stars 113 forks source link

segF when evaluating pytorch file using gpgpu-sim #256

Open Wen-Tian-Pineapple opened 11 months ago

Wen-Tian-Pineapple commented 11 months ago

image image When evaluating python program I ran into above error, the last line of gpgpu output is just thread block. I used PyInstaller to compile the python file and use the compiled file from "dist" directory, maybe that cause this problem? Also I'm using cuda11.01 and gcc 7.3.1 on centOS7 Anybody have any idea what the issue might be?

JRPan commented 11 months ago

There is SEGF. You can run this in directly or in gdb to see which line caused the issue.

Thanks

Wen-Tian-Pineapple commented 11 months ago

Thanks, Currently I'm trying to debug it with gdb Also I wanted to mention that I did the same thing as #238, was trying to run all the things in the docker provided in this repo and was experiencing the similar segmentation fault when processing kernel-17.traceg. image image

JRPan commented 11 months ago

Yea that's fine. If you can tell me the exact line which has SEGF I can provide some hints and help you narrow down the issue.

Wen-Tian-Pineapple commented 11 months ago

Yea that's fine. If you can tell me the exact line which has SEGF I can provide some hints and help you narrow down the issue.

Thanks for the reply, So below is the problem. image

Wen-Tian-Pineapple commented 11 months ago

The segF happens with DEPBAR instruction and specifically when "Check for the case that the LDGSTSs monitored have finished when encountering the DEPBAR instruction" Maybe the new added LDGSTS Support is buggy?

JRPan commented 11 months ago

@Connie120

tyhiwzm commented 11 months ago

The segF happens with DEPBAR instruction and specifically when "Check for the case that the LDGSTSs monitored have finished when encountering the DEPBAR instruction" Maybe the new added LDGSTS Support is buggy?

While using the parboil-sad workload, I encountered the same error in the _dev version_.

JRPan commented 11 months ago

Thank you for the info.

Unfortunately we have a major conference deadline approaching and we won't be able to work on that soon.

You may look into it if you want, we are happy to accept any fix.

I suggest you checkout a commit right before the LDGSTS merge and use that for now.

Thanks!

JRPan commented 11 months ago

@Wen-Tian-Pineapple What workload is this?

You are using TITANV config which does not have DEPBAR feature. So possibly we did not disable this correctly on old configs that do not have the feature.

Please use https://github.com/accel-sim/gpgpu-sim_distribution/tree/53e99da4d21eacbf103ba55bcc9cb6e05219cb91 and https://github.com/accel-sim/accel-sim-framework/tree/241762826c193e6589ea9959bd074d94c826bc15 instead

@tyhiwzm Which config are you using?

Thanks!

Wen-Tian-Pineapple commented 11 months ago

@Wen-Tian-Pineapple What workload is this?

You are using TITANV config which does not have DEPBAR feature. So possibly we did not disable this correctly on old configs that do not have the feature.

Please use https://github.com/accel-sim/gpgpu-sim_distribution/tree/53e99da4d21eacbf103ba55bcc9cb6e05219cb91 and https://github.com/accel-sim/accel-sim-framework/tree/241762826c193e6589ea9959bd074d94c826bc15 instead

@tyhiwzm Which config are you using?

Thanks!

Thanks for the reference, now it's working fine. BTW I'm using titanX/titanV config.

tyhiwzm commented 11 months ago

@Wen-Tian-Pineapple What workload is this?

You are using TITANV config which does not have DEPBAR feature. So possibly we did not disable this correctly on old configs that do not have the feature.

Please use https://github.com/accel-sim/gpgpu-sim_distribution/tree/53e99da4d21eacbf103ba55bcc9cb6e05219cb91 and https://github.com/accel-sim/accel-sim-framework/tree/241762826c193e6589ea9959bd074d94c826bc15 instead

@tyhiwzm Which config are you using?

Thanks!

Thank you for your reply. I'm using A100 config, just like https://github.com/accel-sim/accel-sim-framework/issues/138. The machine I am actually using is also an A100. I figured out what's causing the problem. The DEPBAR instruction shows up in the trace I captured, but there's no LDGSTS instruction. So that makes _m_warp[warp_id]->m_ldgdepbarbuf[i].size() in _shader_core_ctx::issuewarp throw an error, since _m_ldgdepbarbuf is empty.

Wen-Tian-Pineapple commented 10 months ago

BTW Would be great to know when this issue is fixed so I can download the newest version i dev branch.