ari-apc-lab / croupier

Cloudify plugin for HPCs and batch applications
https://hub.docker.com/repository/docker/marangiop/cloudify-croupier-ari-apc-lab
Apache License 2.0
6 stars 4 forks source link

Warning messages u'job_id' repeated all the time and job not executed #7

Open marangiop opened 3 years ago

marangiop commented 3 years ago

Cloudify Version 20.02.23~community (Community)

Croupier Version Commit 1eb2f32 of branch grapevine, after merging from permedcoe branch at commit 46239ec

Describe the bug As mentioned in the title, this is a rare bug. This means that I normally run this blueprint every day without problems, but then on some specific occasion this error arises that essentially stops the execution of the entire blueprint at a specific job. The error consists in a Warning message u'job_id' (where job_id is the ID of the job assigned by croupier to a specific job of a blueprint) that is repeated all the time in a never-ending loop after the state of that job has changed to RUNNING. The warning message is repeated approximately every 15 seconds. To be precise, the error message is also always shown after the state of ever job changes to PENDING, but this particular case is not important because it does not stop the job to be executed.

image

Buried in the Warning messages, there is an additional Warning message: filedescriptor out of range in select(). This warning message is issued only once or twice, then the above Warning message u'job_id' continues to be shown all the time.

image

If I check the logs of that specific job inside the CESGA users portal, I see that the job has a COMPLETED status. This means that Croupier has correctly detected that the job has started being executed on HPC, but it has failed to detect that the job has finished.

To check the logs of the specifc deployment yourself Just enter the Cloudify instance deployed at http://cloudify.grapevine-project.eu/, then search for the deployment called cycle_12_part2_greece_17_05

To Reproduce Steps to reproduce the behavior:

  1. It's quite hard to reproduce this specific behaviour with a local tox test because as I said this is a rare bug, and the deployment is based on some data that is downloaded by another deployment specified in another blueprint.

Expected behavior I don't mind if the warning message is shown up. Certainly I don't expect that a given deployment is stopped due to this warning message.

marangiop commented 3 years ago

This has happened again today, for the deployment

cycle_12_part2_greece_03_06

marangiop commented 3 years ago

This happened again yesterday, for the deployments (these deployments were run at the same time in the night, using the same reservation)

cycle_12_part2_greece_05_06 cycle_12_part2_spain_05_06