Closed stevenstetzler closed 2 years ago
Are you running with parsl monitoring enabled? In the parsl_config
section of the bps yaml, that would be enabled with
parsl_config:
monitoring: True
The part of the code where that ValueError
occurs is only executed if parsl monitoring is disabled and the state of each job is ascertained from the log files. The ctrl_bps code was changed some months ago to use UUIDs for the qgraphNodeId instead of ints. For large workflows, it's inefficient to use the logs for job state, so enabling monitoring is preferred. Nevertheless, we can put in a fix for this.
I'm running with monitoring off. Any idea why this has only show up in one context? I haven't had this issue before with any runs using the same version of the stack.
I'm choosing to run with monitoring off because I wanted to run multiple bps submit
commands at once from the same node, but the WorkQueue and MonitoringHub ports were conflicting. I made a patch so that WorkQueue port can be defined through YAML and/or environment variables, but the change for the MonitoringHub was more in-depth and so I circumvented it by turning monitoring off.
Any idea why this has only show up in one context?
Other than explicitly trying to get the workflow status via the python api, I think it would probably arise from a retry for a job that failed on the first attempt.
closed by #53
I've been using this Parsl BPS plugin for a while to run LSST pipelines processing with Slurm. I saw an error from my last run that I've never seen before:
This error caused the BPS submission to shut down and stop execution. I'm not sure what caused it and I've never seen it before in previous executions, but I'm opening this in case it's something that should be fixed.