Closed sinharudraneel closed 3 months ago
Can you share the trace file in some way?
I saw this before in some reduce kernels (actually, yours is a reduce kernel as well. Maybe this is the same issue).
In the last several lines of traces, there is probably a exit with mask FFFFFFFF which means all therads within the warp is exit. However, there will be some traces after that with a mask 00000001, which means that 1 thread is still active.
This is NVBit issue we posted here as well https://github.com/NVlabs/NVBit/issues/122. For now, you can either remove the assert or manually delete the lines after the exit.
Thanks
Sure! The traces were generated in this folder. There are 131 trace files for the program, and I have not been able to go through all of them yet – the ones I did go through did not seem to have an instance of the trace-post-exit with the same mask that you referred to, but I did find two EXIT statements with a mask of 00000000 (lines 146 and 152) in this trace file. I will keep looking though, what you mentioned is probably the case.
Thank you!
It is kernel-7. https://github.com/sinharudraneel/dp-performance-accel-sim/blob/week3-traces/simpletorch3-traces/traces/kernel-7.traceg#L149C1-L149C25
The block size is 8, so only eight threads are active. The mask is 000000ff
.
at:
ba20 000000ff 0 EXIT 0 0
All threads are exited. But one thread is still active after ba20
.
Currently we are unable to fix this. This is an NVBit problem. Ignoring it for now won't harm too much.
Ah I see, thank you!
I have been trying to run a simple pytorch one linear layer neural network which trains on random data on the SASS driven version of Accel Sim as a test for a larger project. I am able to generate the traces properly from this program but when I try to run the traces through the SASS driven mode of Accel-Sim, I get this error. Essentially, it is an assert failure for `active.any() == false' in shader.cc, I do not completely understand the root of the problem. Is this an error because of unsupported operations running on accel-sim, have I written my program wrong (although it runs just fine as it is on the Tesla V100 that I have been using), or is it a bug in the code? Just out of curiosity I commented out the assert statement and ran the traces through the simulator again, it passed the tests. I understand that I probably am messing with an important assertion check which should not be commented out, but could someone let me know what the error might be instead?