alibaba / clusterdata

cluster data collected from production clusters in Alibaba for cluster management research
1.6k stars 408 forks source link

Question about State machine of batch task and instance #22

Open LernaeanHydra opened 6 years ago

LernaeanHydra commented 6 years ago

As what is given in trace_201708.md, we found that both task and instance all have status of "Waiting". and what is declared is: task -> Waiting: A task in not initialized yet instance -> Waiting: The instance can't run because some of its dependencies have not finished IIUC, if a instance's status is "Waiting", we can be sure that there is some dependency among tasks that has not been satisfied. so whenever task's status is no longer "Waiting", its instances' status can change. so the "Waiting" tasks mean that they are waiting for other tasks finishing ?

In addition, I found that some instance reboot after it arrived at "Failed" status, but others not. Is there any mechanism for judging whether a instance should reboot?

I will appreciate if someone could help.

HaiyangDING commented 6 years ago

I would try to clarify this by answering your questions:

so the "Waiting" tasks mean that they are waiting for other tasks finishing ?

No. A task is "Waiting" is because its relevant data is not initialized yet, e.g. determine the properties of task instances based on the plan issued by jobmaster; obtaining information about data locality, etc

In addition, I found that some instance reboot after it arrived at "Failed" status, but others not. Is there any mechanism for judging whether a instance should reboot?

There is a limitation on the times for an instance can reboot after failure. An instance will not try reboot again if any of the following conditions is met: 1) reach maximum reboot numbers OR 2) an explicit "unretry" signal is given

LernaeanHydra commented 6 years ago

@HaiyangDING Thank you for your answer! I now know what task "Waiting" means. Then what is the state of the task when the task's dependence is not satisfied?

ChenJing036-hub commented 5 years ago

Hi. I want to know the explanation of the canceled state.And what are the 12 types of tasks? I will appreciate if someone could help.