ari-apc-lab / croupier

Cloudify plugin for HPCs and batch applications
https://hub.docker.com/repository/docker/marangiop/cloudify-croupier-ari-apc-lab
Apache License 2.0
6 stars 4 forks source link

CANCELLED BY 13431 warning after jobs are queued #11

Open marangiop opened 3 years ago

marangiop commented 3 years ago

Cloudify Version 20.02.23~community (Community)

Croupier Version Commit 2abb8b6 of branch permedcoe

image

marangiop commented 3 years ago

Deployment met_00_1806_share_8last from blueprint met-00-test-last8 inside http://cloudify.grapevine-project.eu/

marangiop commented 3 years ago

We noticed a recurrent bug when we try to execute a workflow manually from the Cloudify GUI. After we have started the run_jobs workflow on a given deployment, the jobs are queued and immediately after all of the jobs have been queued, the warning message CANCELLED BY 13431 comes up all the time. The jobs of the blueprint are actually submitted and executed correctly (as showsn by squeue) on CESGA HPCM but due to this warning it is no longer possible to monitor the status of the jobs. Eventually we need to kill the execution of run_jobs after we are sure that indeed the jobs have all been executed on HPC.

This becomes quite problematic for blueprints that have dependencies, since the orchestrator fails to detect the the jobs status is COMPLETED and as a result, it does not submit any jobs that depend on the current jobs.