Questions about the run_sim.py implementation

YounghunGo commented 1 year ago

hello. While looking at the simulator code, I have a question.

While looking at the SRTF code (def shortest_first_sim_jobs(...)) in run_sim.py, I saw that the execution time including overhead was subtracted from the code that updates the remaining_iteration of the job in the runnable_jobs list.

In my opinion, this overhead is the time from the start of initial training to the start of iteration. Therefore, it seems that this overhead needs to be subtracted only once at the beginning, but in this code, it seems that this time is repeatedly subtracted to calculate it.

Thanks for answering the question :)

Rivendile commented 1 year ago

Thanks for your question. For every scheduling interval, the jobs are scheduled and restarted in SRTF. Therefore, the overhead is calculated every scheduling interval.

YounghunGo commented 1 year ago

In the case of resume according to the shortest policy for each scheduling interval, I think it is right to add overhead. However, in the current code, the overhead is repeatedly added even when it is running without preemption. What I'm curious about is: Is this overhead, as I understand it, the time to drive the workload before the iteration starts? thank you.

for rjob in JOBS.runnable_jobs:
            if 'RUNNING' == rjob['status']:
                if rjob['model_name'] in overhead_dict[rjob['num_gpu']]:
                    print('add overhead job:%s' % rjob['job_idx']) #**I added it**
                    tmp_oh = overhead_dict[rjob['num_gpu']][rjob['model_name']]
                else:
                    tmp_oh = 10
                # tmp_oh = 0
                tmp = max(event_time - rjob['last_check_time']-tmp_oh, 0) 
                rjob['total_executed_time'] = rjob['total_executed_time'] + event_time - rjob['last_check_time'] 
                rjob['remaining_iteration'] -= tmp/rjob['iteration_time']

This log is the one I got by adding a line above.

---- job[0] is added
---- job[1] is added
add overhead job:0
---- job[2] is added
add overhead job:0
add overhead job:1
---- job[3] is added
add overhead job:0
add overhead job:1
add overhead job:2
---- job[4] is added
add overhead job:0
add overhead job:3
add overhead job:1
add overhead job:2
---- job[5] is added
add overhead job:0
add overhead job:4
add overhead job:3
add overhead job:1
add overhead job:2
---- job[6] is added
add overhead job:0
add overhead job:4
add overhead job:3
add overhead job:5
add overhead job:1
add overhead job:2
...

Rivendile commented 1 year ago

Sorry for the misleading "schedule interval". In our comparison, the SRTF scheduler stops, re-schedules, and restarts all jobs once some event happens (e.g., job starts/ends), i.e., the scheduler re-schedules all jobs for every while iteration. In the simulator code, however, we do not stop and restart the job explicitly, but use the overhead to represent preemption.

Minimizing the reschedule overhead is an optimization for these schedulers, which is out of the scope of our paper and this implementation.

Besides, we calculate the execution information at the beginning of next schedule interval. So we add an overhead for all running jobs at the beginning of every schedule interval.

Rivendile / Muri

Questions about the run_sim.py implementation #2