JF-D / Proteus

10 stars 3 forks source link

Task allocated to wrong executor #2

Closed tareqmahmood closed 1 month ago

tareqmahmood commented 1 month ago

I get a KeyError after running the following command.

python megatron_gpt.py -model gpt-1 -bs 4 -cluster clusters/dgx1_v100_2ib/n1_g8.json -ps pp -pp-deg 2 -mp-deg 2 --profile-iters 10

The error:

Traceback (most recent call last):
  File "/users/tareq/Proteus/examples/megatron_gpt.py", line 494, in <module>
    cost = sim.run('log/trace')
  File "/users/tareq/Proteus/proteus/simulator/simulator.py", line 1077, in run
    cur_stage = dev_group.execute()
  File "/users/tareq/Proteus/proteus/simulator/simulator.py", line 781, in execute
    self.dev_execute(cur_stage)
  File "/users/tareq/Proteus/proteus/simulator/simulator.py", line 811, in dev_execute
    ntsks = self.executors[tsk.rank].alloc(tsk)
KeyError: 0

Upon further digging, I found that the simulator is allocating a task tsk = f7806[x_2_148/0:gpu:0] to executors while self.executors has the value:

self.executors = {
    6: <proteus.simulator.simulator.DevExecutor object at 0x7f80dd2f3220>, 
    7: <proteus.simulator.simulator.DevExecutor object at 0x7f80dd2f32b0>,
    5: <proteus.simulator.simulator.DevExecutor object at 0x7f80dd2f3190>, 
    4: <proteus.simulator.simulator.DevExecutor object at 0x7f80dd2f3100>
}

The rank of the task is not present in the executor dictionary, hence throwing a KeyError. Am I doing this wrong?

JF-D commented 1 month ago

Fixed. This is caused by the wrong parallelization mapping of the input x_2.