Azure-Samples / azure-batch-samples

Azure Batch and HPC Code Samples
Other
261 stars 487 forks source link

`TaskStateMonitor.WhenAll()` Blocked When Waiting for 'Running' State of 'Completed' Task #238

Closed vejuhust closed 6 years ago

vejuhust commented 6 years ago

As per the doc, it will wait until its members has reached a desired state at least once.

Here comes my code (I know its dirty) ---

var tasksDetail = new ODATADetailLevel(selectClause: "id,executionInfo");
var tasks = await batchClient.JobOperations.ListTasks(jobId, tasksDetail).ToListAsync();
Parallel.ForEach(tasks, (task) =>
{
    if (task == null)
    {
        Logger.Warn("Invalid Task info");
        return;
    }

    if (task.State == TaskState.Completed)
    {
        Logger.Info($"Task '{task.Id}' is already completed, skipping monitoring");
        return;
    }

    // Wait for 'Running' state
    batchClient.Utilities.CreateTaskStateMonitor().WhenAll(new[] {task}, TaskState.Running, timeout).Wait();
    task.RefreshAsync(new ODATADetailLevel(selectClause: "id,nodeInfo")).Wait();
    Logger.Debug($"Task '{task.Id}' is running on Node '{task.ComputeNodeInformation?.ComputeNodeId}' now");

    // Wait for 'Completed' state
    batchClient.Utilities.CreateTaskStateMonitor().WhenAll(new[] {task}, TaskState.Completed, timeout).Wait();
    task.RefreshAsync(new ODATADetailLevel(selectClause: "id,executionInfo")).Wait();
    Logger.Info($"Task '{task.Id}' is completed now, duration: {task.ExecutionInformation.EndTime - task.ExecutionInformation.StartTime:c}");
});

This code snippet works well. But when I tried to switch 'Running' and 'Completed' section (like below), it blocked while waiting for 'Running' state. IMHO, this behavior differs from what the document describes --- the task first became 'Running' and then 'Completed', so after it reached 'Completed' state, it must have already reached 'Running' and should not be blocked at batchClient.Utilities.CreateTaskStateMonitor().WhenAll(new[] {task}, TaskState.Running, timeout).Wait();.

    // Wait for 'Completed' state
    batchClient.Utilities.CreateTaskStateMonitor().WhenAll(new[] {task}, TaskState.Completed, timeout).Wait();
    task.RefreshAsync(new ODATADetailLevel(selectClause: "id,executionInfo")).Wait();
    Logger.Info($"Task '{task.Id}' is completed now, duration: {task.ExecutionInformation.EndTime - task.ExecutionInformation.StartTime:c}");

    // Wait for 'Running' state
    batchClient.Utilities.CreateTaskStateMonitor().WhenAll(new[] {task}, TaskState.Running, timeout).Wait();
    task.RefreshAsync(new ODATADetailLevel(selectClause: "id,nodeInfo")).Wait();
    Logger.Debug($"Task '{task.Id}' is running on Node '{task.ComputeNodeInformation?.ComputeNodeId}' now");
matthchr commented 6 years ago

The monitor looks for tasks in the specific state you requested. It doesn't know that "Completed" means the task was actually "Running" at some previous point (actually, there are ways to get to completed without ever going to "Running" anyway).

This means it actually has to see the task in running - depending on how long the tasks run for and how many you have it may be hard for the monitor to see all the tasks in running. For example with 200 tasks which each run 10s, it's entirely possible/likely the monitor doesn't see all 200 tasks every 10s (since it tries to optimize how it queries to be efficient) which means it could miss the running state for one or more tasks. If it misses "Running" for even one task then it will time out waiting for that task to get to running state.

Is there a particular reason you need to wait for "Running"? Generally speaking it's bad practice to do that, as "Running" is a transient state. A task could go to Running and then to Active and then back to Running and then to Completed. Usually it's best to wait for a steady state such as Completed to avoid races and/or your information being out of date.

vejuhust commented 6 years ago

@matthchr Thanks! I monitor "Running" because I need to know if all my task is up and running, and figure out status of the pool indirectly. Do you have better/correct solutions?

matthchr commented 6 years ago

If you want to know the status of the pool, could you just list the VMs in the pool and check their state?

I've made a note that we should improve the behavior of the monitor for short-lived task states such as running, but for now your best bet is going to be to avoid using TaskStateMonitor to monitor for short-lived task states (i.e. Running).

You could also look into using the new GetTaskCounts API, which gives you the count of tasks in a particular state but not their names.

vejuhust commented 6 years ago

@matthchr Yeah, monitoring the pool directly may be a good solution. I used to watch the pool by myself.

Once in a while, I noticed some nodes in the pool became 'Unusable' while resizing, and it seems that such nodes would stay in that state forever --- would not reboot nor reimage. It couldn't be resolved by automatic scaling, and I had to add extra health node manually. Do you have any solutions to this situation?

matthchr commented 6 years ago

There are a variety of reasons why a pool node might go to unusable. If you have application packages, failing to download them (due to storage errors, etc) will eventually move the node to unusuable. There can also be other things such as infrastructure blips that move a node to unusable. Lastly, nodes can go to unusuable if you have a pool in a VNet or are running a custom image and there are issues with your VNET or custom image (i.e. the vnet/custom image is blocking ports that Batch needs to communicate). We have autorecovery mechanisms in place to recover nodes that go to unusuable but they can sometimes take some time (15-45m or possibly more), and in some cases such as the custom image/vnet cases we can't change the configuration of your vnet and so the nodes would be stuck forever. Generally those sorts of "Configuration" problems will happen to all the nodes in the pool, not just some though.

Generally speaking, nodes going to unusuable is bad/unexpected, if it's happening often for your workload/pools it might not be a bad idea to go through the Azure Portal and raise a ticket for your Batch account and give the details of the pool it's happening on, that way somebody can investigate what is going wrong and fix it.

vejuhust commented 6 years ago

@matthchr Thanks for your explanation!

In my case, it wasn't vNet issue or custom image issue --- I used Cloud Service batch nodes (OS Family: 5, Size: Medium). And it took a long time to 'Creating' the node before declaring it's 'Unusable'.

I encountered such issue with internal Cosmos node before and I did think about raising a ticket for the batch node, but --- 1) my free subscription 'Visual Studio Enterprise' has no technical supports, 2) keeping the spot costs money...so many computing hours 😢

matthchr commented 6 years ago

I am closing this because I think all of the above questions have been answered.