Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

Dependent tasks are not triggered after parent tasks completes #247

Closed hieuhc closed 5 years ago

hieuhc commented 5 years ago

Problem Description

I submitted a job containing a set of parent tasks and a set of dependent tasks. After a while, all parent tasks completed with exit code 0 but dependent tasks are still in Active status and never triggered.

Batch Shipyard Version

Pool was created with 3.5.2-cli, jobs were submitted by both 3.5.2-cli and 3.6.1-cli. I also tried force_enable_task_dependencies: true.

Steps to Reproduce

In my case, I reduced to a simple job file with 2 tasks with dependencies.

Expected Results

Dependent tasks should be triggered after parent tasks enter Completed status with exit code 0

Sample job file

job_specifications:
-   allow_run_on_missing_image: true
    environment_variables:
        INPUT_MODE: env
    id: batch-test-depend-4
    force_enable_task_dependencies: true
    tasks:
    -   command: bash -c "cd /workspace && python -m service.main.processor"
        docker_image: parent-task:latest
        environment_variables:
            COMPUTE_CONTEXT: GPU
            CONFIG_FILE_RUNNER: service/config/job_processor.config.dev.yml
            document_id_list: '["5ea0d841-a590-41bf-a81c-0206c787a411","1fcec487-edca-4229-ab6f-868e6f2c13b0"]'            
        gpu: true
        id: task-parent    
        remove_container_after_exit: true
    -   command: bash -c "cd /workspace && python -m service.main.processor"
        depends_on:
        - task-parent
        docker_image: dependent-task:latest
        environment_variables:
            COMPUTE_CONTEXT: GPU
            CONFIG_FILE_RUNNER: service/config/job_processor.config.dev.yml
            document_id_list: '["5ea0d841-a590-41bf-a81c-0206c787a411","1fcec487-edca-4229-ab6f-868e6f2c13b0"]'            
        gpu: true
        id: task-dependent
        remove_container_after_exit: true
alfpark commented 5 years ago

Are you positive that task-parent exited with exit code 0?

I just tried a sample recipe with task dependencies, and everything appears to be working correctly. Would you be able to try that recipe and see if you can repro task dependencies getting stuck?

alfpark commented 5 years ago

Also which Azure region are you running your jobs? Could you try a different region temporarily?

hieuhc commented 5 years ago

Yes the task-parent exited with code 0, or at least it is showed like that in Azure Portal, I haven't tried to ssh to that node to check. In task-parent I use sys.exit(0) in Python 2.7 to terminate the application.

The jobs have been running in Azure region South Central US. I will try to create an Azure Batch account in West Europe to see how it behaves.

alfpark commented 5 years ago

Thanks, it looks like it may be an Azure Batch regional issue. We're tracking this internally. In the meantime, if you can confirm that it does not repro in a different region, that would be great.

hieuhc commented 5 years ago

I haven't got a chance to create another Azure Batch account. But now the problem seems to go away, dependent tasks can be triggered as expected. Maybe that regional issue was fixed. Thank you for your support.

alfpark commented 5 years ago

There was a delay in task dependency processing in South Central US and the issue has since been mitigated. Thanks for your patience.