accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
298 stars 116 forks source link

run_simulations.py fails for applications with a large number of arguments #232

Closed barnes88 closed 2 months ago

barnes88 commented 1 year ago

The trace files are stored in a directory that is given by the arguments provided to the application. If these arguments are long this can result in simulation jobs failing without any given reason. We should limit the length of the argument string so we don't end up with excessively long directory names.

An example of a job that fails due to a long argument string is below. It can be fixed by shortening the directory name of the trace files and removing the arguments from the define-all-apps.yml, but ideally we wouldn't need this workaround.



Using logfiles ['/scratch/tgrogers-disk01/a/barnes88/private-accel-sim/util/job_launching/../job_launching/logfiles/sim_log.1ggnn.23.06.14-Wednesday.txt']
squeue.id       Node                            App                     AppArgs                 Version                 Config          RunningTime     Mem     JobStatus                       Basic GPGPU-Sim Stats
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
146046          UNKNOWN                         sift10k_multi           __base_filename_data    sift10k_multi.accels    QV100-SASS      UNKNOWN         UNKNOWN NOT_RUNNING_NO_OUTPUT                                                                                             
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Passed:0/1, No error:0/1, Failed/Error:1/1, Running:0/1, Waiting:0/1                                                                                                                                                                                                              
Contents :                                                                                                                                    
All 1 Tests Done.                                                                                                                                                                                                                                                                 
Something did not pass.