DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Ability to use dsub's python api directly in a python script #205

Open indraniel opened 4 years ago

indraniel commented 4 years ago

We were interested in using dsub inside Google Cloud Functions. It appears that Google Cloud Functions doesn't allow for making subprocess calls. It is possible to call dsub through its python api directly? If so, could an example be shared in the documentation?

wnojopra commented 4 years ago

Hi @indraniel !

We currently mention in our Python API doc and in a few places in code that you are discouraged from using the Python API. The main reason for this is that the API may be unstable between dsub releases as we do not have a fully developed specification for it. Another reason for this is because we have not focused any effort on developing a good Python API as we haven't had a compelling reason to do so.

Using a dsub Python API inside a Cloud Function may just be the use case dsub was looking for to motivate further development of the Python API. I've started the motions for this internally, and we may see better documentation on a Python API spec in an upcoming release.

All that being said, yes, it is possible to call dsub using its Python API. It isn't documented at all, but we do have an integration test that gives a working example. I went ahead and plucked out relevant lines to create a cloud function that submits a 'hello world' dsub job. Feel free to try this out at your own risk (this code may not work on dsub != 0.3.9, there's currently no documentation on how to change some of the parameters, etc).

from dsub.commands import dsub
from dsub.providers import google_cls_v2
from dsub.lib import job_model
from dsub.lib import param_util

def hello_world_dsub(request):
    """Responds to any HTTP request.
    Args:
        request (flask.Request): HTTP request object, ignored in this dsub example. 
    Returns:
        A string containing the launched job's id, user-id, and tasks-ids. 
    """
    project = 'MY_GCP_PROJECT'
    location = 'us-central1'
    logging_path = 'gs://MY_GCS_BUCKET/logs'
    job_name = 'cloud_function_test_job'

    logging = param_util.build_logging_param(logging_path)

    job_resources = job_model.Resources(image='ubuntu', logging=logging, zones=['us-central1-*'])

    # empty params (no --env, --input, --output, or --labels)
    job_params = {
        'envs': set(),
        'inputs': set(),
        'outputs': set(),
        'labels': set(),
    }

    # empty tasks (just launching one simple job, similar to not using --tasks)
    task_descriptors = [
        job_model.TaskDescriptor({
            'task-id': None
        }, {
            'envs': set(),
            'labels': set(),
            'inputs': set(),
            'outputs': set(),
        }, job_model.Resources())
    ]

    launched_job = dsub.run(
        google_cls_v2.GoogleCLSV2JobProvider(dry_run=False, project=project, location=location),
        job_resources,
        job_params,
        task_descriptors,
        name=job_name,
        command='echo HELLOWORLD',
        wait=False,
        disable_warning=True  # If not disabled, dsub will error saying this API is unstable
    )

    return str(launched_job)
indraniel commented 4 years ago

Thanks for the response and sharing a short-term approach regarding the Python API!

I must admit that it makes me leery to use dsub like so in a production service due to the comments mentioned in the Python API doc, and in the above:

The main reason for this is that the API may be unstable between dsub releases as we do not have a fully developed specification for it. ... Feel free to try this out at your own risk (this code may not work on dsub != 0.3.9, there's currently no documentation on how to change some of the parameters, etc).

It seems a more flexible and stable option in the meantime would be to use Google Cloud Run, where we can make subprocess calls to dsub from a python script.

But it's good to hear that this is may be on the longer term feature roadmap for dsub:

Using a dsub Python API inside a Cloud Function may just be the use case dsub was looking for to motivate further development of the Python API. I've started the motions for this internally, and we may see better documentation on a Python API spec in an upcoming release.

The ability to use dsub's python API directly might allow for some imaginative use-cases with google cloud batch computing.

I'll go ahead and try out your above code example in an experiment. Thanks again for sharing this, and letting us know that you all may elaborate on this feature in the future!

apgiorgi commented 4 years ago

Hi @wnojopra,

Thanks to your working example, I've set up a simple dsub automated pipeline using Cloud Storage Notifications triggering a Cloud Function with dsub's Python API.

With this setup, I just need to upload a file into a "queue" bucket (in my case, a simulation configuration file) to trigger a new job. It's still a bit raw, but my team is already using it to push dozens of jobs at once.

The only undocumented piece that was not included in your example was the input_data object using InputFileParamUtil. Here's a simplified snippet:

    # Create the input files object
    input_file_param_util = param_util.InputFileParamUtil("input")
    input_data = set()
    input_data.add(
        input_file_param_util.make_param(
            "INPUT_FILE", f"gs://{event['bucket']}/{event['name']}", False
        )
    )

    # Job params (--env, --input, --output, and --labels)
    job_params = {
        "envs": set(),
        "inputs": input_data,
        "outputs": set(),
        "labels": set(),
    }

Thanks a lot for building dsub, and I'm looking forward to testing and contribute to the Python API's maturing.