argonne-lcf / balsam

High throughput workflows and automation for HPC
76 stars 21 forks source link

Balsam with Flux Framework? #343

Open vsoch opened 1 year ago

vsoch commented 1 year ago

Hi! I'm looking at the tutorial here: https://github.com/CrossFacilityWorkflows/DOE-HPC-workflow-training/tree/main/Balsam and trying to imagine how this works with a job manager like Flux Framework. Here is some of my early guesses so far:

And the plot results seems straight forward. Thanks for the help / advice - I will likely just start playing around with it and figure out some of these concepts, but I wanted to start a conversation here to anticipate getting greater insights first.

vsoch commented 1 year ago

okay balsam seems to be some server running at your lab? Does that mean it's only available there or can we set it up, one off?

image

cms21 commented 1 year ago

Hi @vsoch,

Sorry, the tutorial materials should have been more clear. Balsam uses a remote server that hosts a database for the user that stores aspects of the workflow (such as Applications, Jobs, etc.). Currently, there is only one Balsam server hosted at ALCF. If you just want to test out Balsam, we can look into getting you an account to access the server. We do have instructions for setting up a server, if that is something you'd be interested in trying.

To answer some of your other questions, I'm not familiar with Flux Framework, but if it's a scheduler like SLURM or PBS Pro, what one could do is implement a FluxFrameworkClass (e.g. like the slurm example here). We'd also have to know what launcher you use and implement an AppRun class (like mpiexec here).

The Applications and Jobs are stored within the database hosted on the Balsam server. In the tutorial example, the jobs created needed to know what application they were running (app_id) and in what Site (site_name). The Site is the project space for the user's workflow that has a representation on the client side (on the machine file system) and on the server side (within the database). Setting up the site involves choosing some setting for the machine (such as the scheduler class and app run class described above), but also things about the user's allocation, the machine queues, etc. For the submit step, running that code causes Balsam to create and run a submit script to the scheduler on the machine where it's running (we currently have support for SLURM, PBS Pro, and Cobalt).

If Balsam is still something you'd like to test out, let us know.

vsoch commented 1 year ago

@cms21 no worries - actually that link to the balsam docs is great, maybe it would be good to add to the repository top right URL (alongside the description?) E.g., here:

image

I think if it was in the README somewhere I missed it!

And I think adding a FluxFramework class is a great idea - I won't have time today but I'll add this to my TODO and we can use this issue for tracking and discussion. For some context - I'm wanting to test this out in the Flux Operator and it would make sense to set up the same hosted server, just in Kubernetes! I got pretty far today until I realized we need to do additional work to add the flux class. But I'm having some problems with the container build. Here is the Dockerfile:

FROM python:3-slim

# This is a demo container for balsam + the flux operator.
# It should not be used in production!

# Any setting defined in the balsam.server.conf.Settings class
# can be set as an environment variable below.
# Settings use either the BALSAM_, BALSAM_AUTH_, or BALSAM_OAUTH_ prefix,
# depending on the category. Other (secret) settings are defined by
# the operator

ENV GUNICORN_CONFIG_FILE="/balsam/gunicorn.conf.py"
ENV SERVER_PORT=8000

# Logging
ENV BALSAM_LOG_LEVEL=INFO
ENV BALSAM_LOG_DIR="./balsam-logs"
ENV BALSAM_AUTH_TOKEN_TTL=259200
ENV BALSAM_AUTH_LOGIN_METHODS='["password", "oauth_authcode", "oauth_device"]'
ENV BALSAM_OAUTH_SCOPE="read_basic_user_data"

WORKDIR /balsam

RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y wget \
       lsb-release \
       git \
       gnupg && \
       sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' && \
       wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - && \
    apt-get install -y build-essential \
       postgresql \
       libpq-dev && \
    apt-get clean all && \
    rm -rf /var/lib/apt/lists/*

RUN git clone --depth 1 https://github.com/argonne-lcf/balsam /balsam
WORKDIR /balsam
RUN pip install --upgrade pip && pip install -r requirements/deploy.txt
RUN mkdir -p /balsam/log && \
    cp /balsam/balsam/server/gunicorn.conf.example.py /balsam/gunicorn.conf.py
COPY ./entrypoint.sh /balsam/entrypoint.sh
ENTRYPOINT ["/balsam/entrypoint.sh"]

The log error (it seems to be choking on the path):

Error: class uri 'balsam.server.gunicorn_logger.RotatingGunicornLogger' invalid or not found: 

[Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/gunicorn/util.py", line 99, in load_class
    mod = importlib.import_module('.'.join(components))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/balsam/balsam/server/__init__.py", line 1, in <module>
    from fastapi import HTTPException, status
  File "/usr/local/lib/python3.11/site-packages/fastapi/__init__.py", line 7, in <module>
    from .applications import FastAPI as FastAPI
  File "/usr/local/lib/python3.11/site-packages/fastapi/applications.py", line 15, in <module>
    from fastapi import routing
  File "/usr/local/lib/python3.11/site-packages/fastapi/routing.py", line 23, in <module>
    from fastapi.dependencies.models import Dependant
  File "/usr/local/lib/python3.11/site-packages/fastapi/dependencies/models.py", line 3, in <module>
    from fastapi.security.base import SecurityBase
  File "/usr/local/lib/python3.11/site-packages/fastapi/security/__init__.py", line 1, in <module>
    from .api_key import APIKeyCookie as APIKeyCookie
  File "/usr/local/lib/python3.11/site-packages/fastapi/security/api_key.py", line 3, in <module>
    from fastapi.openapi.models import APIKey, APIKeyIn
  File "/usr/local/lib/python3.11/site-packages/fastapi/openapi/models.py", line 103, in <module>
    class Schema(BaseModel):
  File "/usr/local/lib/python3.11/site-packages/pydantic/main.py", line 292, in __new__
    cls.__signature__ = ClassAttribute('__signature__', generate_model_signature(cls.__init__, fields, config))
                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydantic/utils.py", line 258, in generate_model_signature
    merged_params[param_name] = Parameter(
                                ^^^^^^^^^^
  File "/usr/local/lib/python3.11/inspect.py", line 2722, in __init__
    raise ValueError('{!r} is not a valid parameter name'.format(name))
ValueError: 'not' is not a valid parameter name

And my tweaked entrypoint.sh - I wanted to run the migrate command too:

#!/bin/bash

export BALSAM_LOG_DIR="/balsam/log"
mkdir -p $BALSAM_LOG_DIR
gunicorn balsam server migrate || echo "gunicorn balsam server migrate not successful"
gunicorn --print-config -c /balsam/gunicorn.conf.py balsam.server.main:app
exec gunicorn -c /balsam/gunicorn.conf.py balsam.server.main:app

And I have some of the envars (that I saw for the docker-compose setup) defined by the flux operator:

  services:
    - image: postgres
      name: postgres
      ports:
        - 5432
      environment:
         POSTGRES_USER: postgres
         POSTGRES_PASSWORD: postgres
         POSTGRES_DB: balsam
    - image: redis
      name: redis
      ports:
        - 6379
    - image: ghcr.io/rse-ops/balsam-base:tag-latest
      name: balsam
      workingDir: /balsam
      ports:
        - 8000
      environment:
         BALSAM_DATABASE_URL: "postgresql://postgres:postgres@flux-sample-services.flux-service.flux-operator.svc.cluster.local:5432/balsam"
         BALSAM_REDIS_PARAMS: '{"host": "flux-sample-services.flux-service.flux-operator.svc.cluster.local", "port": "6379"}'
         BALSAM_AUTH_SECRET_KEY: "SOME_SECRET_KEY"
         BALSAM_OAUTH_CLIENT_ID: "SOME_CLIENT_ID"
         BALSAM_OAUTH_CLIENT_SECRET: "SOME_CLIENT_SECRET"

Let me know if you want to see anything else, or if anything sticks out to you. I've added it to my TODO to look into adding Flux to balsam, and likely I won't need to resolve the above issues until after that!

cms21 commented 1 year ago

Hi @vsoch, thanks for the feedback and sharing the error. Perhaps it might helpful to have a brief meeting to discuss your plans? @tomuram and I might be able to be more helpful if we have a better idea of your goals.

vsoch commented 1 year ago

hey @cms21 that sounds like a great idea and I might take you up on it! Let me futz around a little bit with setting up a development environment, and then seeing if I'm able to add Flux. My goals are fairly simple - I'm testing out every workflow tool / simulation that I can with the Flux Operator, the goal being to get a nice survey of the landscape and try to start paving a direction for what (personally I'd like) for workflows at my institution. it's not uncharted territory because there are a ton of tools, but it's certainly not a paved path because I haven't really identified a leader in the space yet.

I will do some work and learning and follow up here! Thank you for the kind offer!

vsoch commented 1 year ago

okay I'm following your docker-compose setup in your docker.yaml, and hitting the same issue:

gunicorn  |     raise ValueError('{!r} is not a valid parameter name'.format(name))
gunicorn  | ValueError: 'not' is not a valid parameter name
gunicorn  | access_log_format                 = {'remote': '%(h)s', 'date': '%(t)s', 'request': '%(r)s', 'status': '%(s)s', 'response_sec': '%(L)s'}
gunicorn  | accesslog                         = /balsam/log/gunicorn.access
gunicorn  | backlog                           = 2048
gunicorn  | bind                              = ['0.0.0.0:8000']
gunicorn  | ca_certs                          = None
gunicorn  | capture_output                    = True
gunicorn  | cert_reqs                         = 0
gunicorn  | certfile                          = None
gunicorn  | chdir                             = /balsam
gunicorn  | check_config                      = False
gunicorn  | child_exit                        = <ChildExit.child_exit()>
gunicorn  | ciphers                           = None
gunicorn  | config                            = /balsam/gunicorn.conf.py
gunicorn  | daemon                            = False
gunicorn  | default_proc_name                 = balsam.server.main:app
gunicorn  | disable_redirect_access_to_syslog = False
gunicorn  | do_handshake_on_connect           = False
gunicorn  | dogstatsd_tags                    = 
gunicorn  | enable_stdio_inheritance          = False
gunicorn  | errorlog                          = /balsam/log/gunicorn.error
gunicorn  | forwarded_allow_ips               = ['127.0.0.1']
gunicorn  | graceful_timeout                  = 30
gunicorn  | group                             = 0
gunicorn  | initgroups                        = False
gunicorn  | keepalive                         = 2
gunicorn  | keyfile                           = None
gunicorn  | limit_request_field_size          = 8190
gunicorn  | limit_request_fields              = 100
gunicorn  | limit_request_line                = 4094
gunicorn  | logconfig                         = None
gunicorn  | logconfig_dict                    = {}
gunicorn  | logger_class                      = balsam.server.gunicorn_logger.RotatingGunicornLogger
gunicorn  | loglevel                          = info
gunicorn  | max_requests                      = 0
gunicorn  | max_requests_jitter               = 0
gunicorn  | nworkers_changed                  = <NumWorkersChanged.nworkers_changed()>
gunicorn  | on_exit                           = <OnExit.on_exit()>
gunicorn  | on_reload                         = <OnReload.on_reload()>
gunicorn  | on_starting                       = <OnStarting.on_starting()>
gunicorn  | paste                             = None
gunicorn  | pidfile                           = gunicorn.pid
gunicorn  | post_fork                         = <Postfork.post_fork()>
gunicorn  | post_request                      = <PostRequest.post_request()>
gunicorn  | post_worker_init                  = <PostWorkerInit.post_worker_init()>
gunicorn  | pre_exec                          = <PreExec.pre_exec()>
gunicorn  | pre_fork                          = <Prefork.pre_fork()>
gunicorn  | pre_request                       = <PreRequest.pre_request()>
gunicorn  | preload_app                       = False
gunicorn  | print_config                      = True
gunicorn  | proc_name                         = balsam-server
gunicorn  | proxy_allow_ips                   = ['127.0.0.1']
gunicorn  | proxy_protocol                    = False
gunicorn  | pythonpath                        = None
gunicorn  | raw_env                           = []
gunicorn  | raw_paste_global_conf             = []
gunicorn  | reload                            = False
gunicorn  | reload_engine                     = auto
gunicorn  | reload_extra_files                = []
gunicorn  | reuse_port                        = False
gunicorn  | secure_scheme_headers             = {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
gunicorn  | sendfile                          = None
gunicorn  | spew                              = False
gunicorn  | ssl_version                       = 2
gunicorn  | statsd_host                       = None
gunicorn  | statsd_prefix                     = 
gunicorn  | strip_header_spaces               = False
gunicorn  | suppress_ragged_eofs              = True
gunicorn  | syslog                            = False
gunicorn  | syslog_addr                       = udp://localhost:514
gunicorn  | syslog_facility                   = user
gunicorn  | syslog_prefix                     = None
gunicorn  | threads                           = 1
gunicorn  | timeout                           = 60
gunicorn  | tmp_upload_dir                    = None
gunicorn  | umask                             = 0
gunicorn  | user                              = 0
gunicorn  | when_ready                        = <WhenReady.when_ready()>
gunicorn  | worker_abort                      = <WorkerAbort.worker_abort()>
gunicorn  | worker_class                      = uvicorn.workers.UvicornWorker
gunicorn  | worker_connections                = 1000
gunicorn  | worker_exit                       = <WorkerExit.worker_exit()>
gunicorn  | worker_int                        = <WorkerInt.worker_int()>
gunicorn  | worker_tmp_dir                    = None
gunicorn  | workers                           = 1
gunicorn  | wsgi_app                          = None
gunicorn  | 
gunicorn  | Error: class uri 'balsam.server.gunicorn_logger.RotatingGunicornLogger' invalid or not found: 
gunicorn  | 
gunicorn  | [Traceback (most recent call last):
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/gunicorn/util.py", line 99, in load_class
gunicorn  |     mod = importlib.import_module('.'.join(components))
gunicorn  |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gunicorn  |   File "/usr/local/lib/python3.11/importlib/__init__.py", line 126, in import_module
gunicorn  |     return _bootstrap._gcd_import(name[level:], package, level)
gunicorn  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gunicorn  |   File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
gunicorn  |   File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
gunicorn  |   File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
gunicorn  |   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
gunicorn  |   File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
gunicorn  |   File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
gunicorn  |   File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
gunicorn  |   File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
gunicorn  |   File "<frozen importlib._bootstrap_external>", line 940, in exec_module
gunicorn  |   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
gunicorn  |   File "/balsam/balsam/server/__init__.py", line 1, in <module>
gunicorn  |     from fastapi import HTTPException, status
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/fastapi/__init__.py", line 7, in <module>
gunicorn  |     from .applications import FastAPI as FastAPI
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/fastapi/applications.py", line 15, in <module>
gunicorn  |     from fastapi import routing
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/fastapi/routing.py", line 23, in <module>
gunicorn  |     from fastapi.dependencies.models import Dependant
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/fastapi/dependencies/models.py", line 3, in <module>
gunicorn  |     from fastapi.security.base import SecurityBase
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/fastapi/security/__init__.py", line 1, in <module>
gunicorn  |     from .api_key import APIKeyCookie as APIKeyCookie
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/fastapi/security/api_key.py", line 3, in <module>
gunicorn  |     from fastapi.openapi.models import APIKey, APIKeyIn
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/fastapi/openapi/models.py", line 103, in <module>
gunicorn  |     class Schema(BaseModel):
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/pydantic/main.py", line 292, in __new__
gunicorn  |     cls.__signature__ = ClassAttribute('__signature__', generate_model_signature(cls.__init__, fields, config))
gunicorn  |                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gunicorn  |   File "/usr/local/lib/python3.11/site-packages/pydantic/utils.py", line 258, in generate_model_signature
gunicorn  |     merged_params[param_name] = Parameter(
gunicorn  |                                 ^^^^^^^^^^
gunicorn  |   File "/usr/local/lib/python3.11/inspect.py", line 2722, in __init__
gunicorn  |     raise ValueError('{!r} is not a valid parameter name'.format(name))
gunicorn  | ValueError: 'not' is not a valid parameter name
gunicorn  | ]
gunicorn  | 

Does this look familiar? I just need to setup a development environment.

basvandervlies commented 1 year ago

I also have the above error with the docker compose setup. I have found the issue the latest python:3-slim docker image is 3.11 and then we have this error. When I switch to FROM python:3.10-slim the gunicorn image works as expected.
`

vsoch commented 1 year ago

@basvandervlies this is super helpful! It looks like I was using python 3.11 too:

root@677f1376fb4e:/balsam# python --version
Python 3.11.3

I'll try downgrading.

vsoch commented 1 year ago

Fix for the image: https://github.com/argonne-lcf/balsam/pull/363