botify-labs / simpleflow

Python library for dataflow programming.
https://botify-labs.github.com/simpleflow/
MIT License
68 stars 24 forks source link

Activities orphan subprocesses? #354

Closed mookerji closed 5 years ago

mookerji commented 5 years ago

Hello!

I've run into this issue with using simpleflow activities to supervise an external subprocess (i.e., a long-running bash or Python executable). I see that you have some of this behavior captured in simpleflow.execute: do you use these in actual workflows? Separately, would make sense to modify the the ActivityPoller's spawn code to reap the entire process tree at the worker PID.

Here's a modified version of examples/basic.py that reproduces the issue.

from __future__ import print_function

from simpleflow import (
    activity,
    Workflow,
    futures,
)

import subprocess

@activity.with_attributes(task_list='quickstart',
                          version='example',
                          heartbeat_timeout=60,
                          retry=2,
                          start_to_close_timeout=40)
class Delay(object):
    def __init__(self, t):
        self.t = t

    def execute(self):
        print("DEBUG: activity context: {}".format(self.context))
        ret = subprocess.run(["while true; do echo $$ && echo \"foo\"; sleep 2; done"] ,
                             shell=True,
                             executable="/bin/bash")
        ret.check_returncode()
        return self.t

class BasicWorkflow(Workflow):
    name = 'basic'
    version = 'example'
    task_list = 'example'
    tag_list = ['a=1', 'b=foo']

    def run(self, t):
        execution = self.get_run_context()
        print('DEBUG: execution context: {}'.format(execution))
        x = self.submit(Delay, t)
        futures.wait(x)
        return x.result

And some commands to start:

simpleflow worker.start --domain helloworld --task-list quickstart --nb-processes 1 --heartbeat 15
simpleflow decider.start --domain helloworld  basic.BasicWorkflow --nb-processes 1

You'll see that that on retry, the forked subprocess from first activity execution is still printing its PID once the retried execution starts:

2018-12-16T13:25:18 INFO [process=MainProcess, pid=45677]: starting <bound method Poller.start of ActivityPoller(domain=helloworld, task_list=quickstart)>
2018-12-16T13:25:18 INFO [process=Process-1, pid=45680]: starting ActivityPoller(task_list=quickstart) on domain helloworld
DEBUG: activity context: {'name': 'basic.Delay', 'version': 'example', 'workflow_id': '2db6cf7f-d267-4888-89cd-0ea920d066bd', 'run_id': '23AUMFaOFuuNg00NuqPXbuwoPG8ZCiD+vEQjZlGAlm6o4=', 'activity_id': 'activity-basic.Delay-1', 'input': '{"args":[60],"kwargs":{}}', 'domain_name': 'helloworld'}
45698
foo
...
45698
foo
2018-12-16T13:26:31 WARNING [process=Process-1, pid=45680]: heartbeat failed: Unable to send heartbeat with token=AAAAKgAAAAIAAAAAAAAAA4YWM15iky7csy3cBAfQDerKLcNeRdwZSnT9H97Z/M4gSi6IfRDRF3/iujZxQYouXtUmDjaonEaOb3jXUXFRW8RulY9joxuex9fv9wYg2oF3ykOVuoeqD/aYiU0QXh35vPH3uPjXqcBU/CXp+awZfXo2rHkufESOirlozOOU5bR60jjceD/u7438vS+XH6uVVWBOuCbJN+yuIhms9I0qgmhkZvxxa3FzSJO1vnOcFr1qY/COwQuBLlbQ94FeKWmFDFl9+UohWT6DqRXkFhN6C5kAFIWUIJcCLTTMxk73Hq4ASIIxxf19DcOQGLjW1Ut0xw==
2018-12-16T13:26:31 WARNING [process=Process-1, pid=45680]: killing (KILL) worker with pid=45697
45698
foo
DEBUG: activity context: {'name': 'basic.Delay', 'version': 'example', 'workflow_id': '2db6cf7f-d267-4888-89cd-0ea920d066bd', 'run_id': '23AUMFaOFuuNg00NuqPXbuwoPG8ZCiD+vEQjZlGAlm6o4=', 'activity_id': 'activity-basic.Delay-1', 'input': '{"args":[60],"kwargs":{}}', 'domain_name': 'helloworld'}
45725
foo
45698
foo
45725
foo
45698

I have a patch available that fixes this, which I'll put up and link to this issue. If there's a more idiomatic way to use the simpleflow API to achieve this, I'm happy to use that too, although it seems like this is probably undesirable behavior.

Edit: patch linked. If it seems ok, let me know and I can add test coverage.

ybastide commented 5 years ago

Nice patch! Yes, please add a test case and I'll merge it ASAP

Yves

mookerji commented 5 years ago

@ybastide thanks for the quick reply, will do!

I have a question around application configuration for the termination timeout, which I'll note in the PR.

nstott commented 5 years ago

Hi, this PR is relevant to my interests, and would be interested in seeing this merged. Is there any sort of timeline for this, or anything outstanding? Thanks

ybastide commented 5 years ago

Hi @nstott; it should be merged in a few days, unless @jbbarth has objections

jbbarth commented 5 years ago

No objection, go ahead @ybastide, and thanks @nstott ! :-)