accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
294 stars 114 forks source link

SASS driven mode failing for PyTorch traces #306

Closed sinharudraneel closed 3 months ago

sinharudraneel commented 3 months ago

I have been trying to run a simple pytorch one linear layer neural network which trains on random data on the SASS driven version of Accel Sim as a test for a larger project. I am able to generate the traces properly from this program but when I try to run the traces through the SASS driven mode of Accel-Sim, I get this error. Essentially, it is an assert failure for `active.any() == false' in shader.cc, I do not completely understand the root of the problem. Is this an error because of unsupported operations running on accel-sim, have I written my program wrong (although it runs just fine as it is on the Tesla V100 that I have been using), or is it a bug in the code? Just out of curiosity I commented out the assert statement and ran the traces through the simulator again, it passed the tests. I understand that I probably am messing with an important assertion check which should not be commented out, but could someone let me know what the error might be instead?

Sleeping for 30s                                                                                                                                                                                                      
Calling job_status.py                                                                                                                                                                                                 
Using logfiles ['/root/accel-sim-framework/util/job_launching/../job_launching/logfiles/sim_log.simpletorch-test.24.06.07-Friday.txt']                                                                                
procman.id      Node                            App                     AppArgs                 Version                 Config          RunningTime     Mem     JobStatus                       Basic GPGPU-Sim Stats 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------                                                                                                                                                                                                               
11              156211a02f7e                    simpletorch             NO_ARGS                 simpletorch.accelsim    QV100-SASS                    0 0       ABORTED                         SIMRATE_IPS=62 K     S
IM_TIME=21 sec (21 sec) TOT_IPC=38      TOT_INSN=1 M    TOT_CYCLE=34 K                                                                                                                                                
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------                                                                                                                                                                                                               
failed job log written to /root/accel-sim-framework/util/job_launching/../job_launching/logfiles/failed_job_log_sim_log.simpletorch-test.24.06.07-Friday.txt                                                          
Passed:0/1, No error:0/1, Failed/Error:1/1, Running:0/1, Waiting:0/1                                                                                                                                                  
Contents /root/accel-sim-framework/util/job_launching/../job_launching/logfiles/failed_job_log_sim_log.simpletorch-test.24.06.07-Friday.txt:                                                                          
11              156211a02f7e                    simpletorch             NO_ARGS                 simpletorch.accelsim    QV100-SASS                    0 0       ABORTED                         SIMRATE_IPS=62 K     S
IM_TIME=21 sec (21 sec) TOT_IPC=38      TOT_INSN=1 M    TOT_CYCLE=34 K                                                                                                                                                

**********************************************************                                                                                                                                                            
simpletorch-NO_ARGS--QV100-SASS. Status=ABORTED                                                                                                                                                                       
Last 10 line of /root/accel-sim-framework/util/job_launching/../../sim_run_11.7/simpletorch/NO_ARGS/QV100-SASS/simpletorch-NO_ARGS.accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0.o11          
------------------                                                                                                                                                                                                    
thread block = 0,0,0                                                                                                                                                                                                  
GPGPU-Sim: Reconfigure L1 cache to 120KB                                                                                                                                                                              
GPGPU-Sim uArch: Shader 32 bind to kernel 7 '_ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_7MeanOpsIffffEEjfLi4EEEEEvT1_'                                                                                 
launching kernel name: _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_7MeanOpsIffffEEjfLi4EEEEEvT1_ uid: 7                                                                                                 
Header info loaded for kernel command : ./traces/kernel-7.traceg                                                                                                                                                      
-accelsim tracer version = 3                                                                                                                                                                                          
-nvbit version = 1.5.3                                                                                                                                                                                                
-local mem base_addr = 0x00007f8a46000000                                                                                                                                                                             
-shmem base_addr = 0x00007f8a48000000                                                                                                                                                                                 
-cuda stream id = 0                                                                                                                                                                                                   
------------------                                                                                                                                                                                                    

Contents of /root/accel-sim-framework/util/job_launching/../../sim_run_11.7/simpletorch/NO_ARGS/QV100-SASS/simpletorch-NO_ARGS.accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0.e11              
------------------                                                                                                                                                                                                    
accel-sim.out: shader.cc:3782: void barrier_set_t::deallocate_barrier(unsigned int): Assertion `active.any() == false' failed.                                                                                        
/root/accel-sim-framework/util/job_launching/../../sim_run_11.7/simpletorch/NO_ARGS/QV100-SASS/slurm.sim: line 52: 16360 Aborted                 (core dumped) /root/accel-sim-framework/util/job_launching/../../sim_
run_11.7/gpgpu-sim-builds/accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0/accel-sim.out -config ./gpgpusim.config -trace ./traces/kernelslist.g                                                 

------------------                                                                                                                                                                                                    
**********************************************************                                                                                                                                                            

All 1 Tests Done.                                                                                                                                                                                                     
Something did not pass.
JRPan commented 3 months ago

Can you share the trace file in some way?

I saw this before in some reduce kernels (actually, yours is a reduce kernel as well. Maybe this is the same issue).

In the last several lines of traces, there is probably a exit with mask FFFFFFFF which means all therads within the warp is exit. However, there will be some traces after that with a mask 00000001, which means that 1 thread is still active.

This is NVBit issue we posted here as well https://github.com/NVlabs/NVBit/issues/122. For now, you can either remove the assert or manually delete the lines after the exit.

Thanks

sinharudraneel commented 3 months ago

Sure! The traces were generated in this folder. There are 131 trace files for the program, and I have not been able to go through all of them yet – the ones I did go through did not seem to have an instance of the trace-post-exit with the same mask that you referred to, but I did find two EXIT statements with a mask of 00000000 (lines 146 and 152) in this trace file. I will keep looking though, what you mentioned is probably the case.

Thank you!

JRPan commented 3 months ago

It is kernel-7. https://github.com/sinharudraneel/dp-performance-accel-sim/blob/week3-traces/simpletorch3-traces/traces/kernel-7.traceg#L149C1-L149C25

The block size is 8, so only eight threads are active. The mask is 000000ff.

at: ba20 000000ff 0 EXIT 0 0

All threads are exited. But one thread is still active after ba20.

Currently we are unable to fix this. This is an NVBit problem. Ignoring it for now won't harm too much.

sinharudraneel commented 3 months ago

Ah I see, thank you!