LSSTDESC / gen3_workflow

Development code for a Gen3-based DRP pipeline implementation
BSD 3-Clause "New" or "Revised" License
5 stars 3 forks source link

Bad conversion of qgraph node id causes bps submit to fail during execution #52

Closed stevenstetzler closed 2 years ago

stevenstetzler commented 2 years ago

I've been using this Parsl BPS plugin for a while to run LSST pipelines processing with Slurm. I saw an error from my last run that I've never seen before:

lsst.daf.butler.cli.utils ERROR: Caught an exception, details are in traceback:
Traceback (most recent call last):
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_bps/gda874bdb8a+8617df967c/python/lsst/ctrl/bps/cli/cmd/commands.py", line 89, in submit
    submit_driver(*args, **kwargs)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_bps/gda874bdb8a+8617df967c/python/lsst/ctrl/bps/drivers.py", line 311, in submit_driver
    submit(wms_workflow_config, wms_workflow)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_bps/gda874bdb8a+8617df967c/python/lsst/ctrl/bps/submit.py", line 66, in submit
    workflow = wms_service.submit(wms_workflow)
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 728, in submit
    workflow.parsl_graph.run(block=True)
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 653, in run
    futures = [job.get_future() for job in self.values()
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 653, in <listcomp>
    futures = [job.get_future() for job in self.values()
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in get_future
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in <listcomp>
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in get_future
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in <listcomp>
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in get_future
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in <listcomp>
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in get_future
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in <listcomp>
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in get_future
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in <listcomp>
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in get_future
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in <listcomp>
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in get_future
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 362, in <listcomp>
    inputs = [_.get_future() for _ in self.prereqs]
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 353, in get_future
    if self.done:
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 304, in done
    elif self.status == _SUCCEEDED:
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 330, in status
    if self.have_outputs():
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 377, in have_outputs
    for node in self.qgraph_nodes:
  File "/gscratch/dirac/stevengs/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 392, in qgraph_nodes
    for _ in [(int(self.gwf_job.cmdvals['qgraphNodeId']),
ValueError: invalid literal for int() with base 10: '5819587f-f6b4-4c47-aa3a-c6e65b43054d'

This error caused the BPS submission to shut down and stop execution. I'm not sure what caused it and I've never seen it before in previous executions, but I'm opening this in case it's something that should be fixed.

jchiang87 commented 2 years ago

Are you running with parsl monitoring enabled? In the parsl_config section of the bps yaml, that would be enabled with

parsl_config:
   monitoring: True

The part of the code where that ValueError occurs is only executed if parsl monitoring is disabled and the state of each job is ascertained from the log files. The ctrl_bps code was changed some months ago to use UUIDs for the qgraphNodeId instead of ints. For large workflows, it's inefficient to use the logs for job state, so enabling monitoring is preferred. Nevertheless, we can put in a fix for this.

stevenstetzler commented 2 years ago

I'm running with monitoring off. Any idea why this has only show up in one context? I haven't had this issue before with any runs using the same version of the stack.

I'm choosing to run with monitoring off because I wanted to run multiple bps submit commands at once from the same node, but the WorkQueue and MonitoringHub ports were conflicting. I made a patch so that WorkQueue port can be defined through YAML and/or environment variables, but the change for the MonitoringHub was more in-depth and so I circumvented it by turning monitoring off.

jchiang87 commented 2 years ago

Any idea why this has only show up in one context?

Other than explicitly trying to get the workflow status via the python api, I think it would probably arise from a retry for a job that failed on the first attempt.

jchiang87 commented 2 years ago

closed by #53