argonne-lcf / balsam

High throughput workflows and automation for HPC
77 stars 21 forks source link

Headless authentication flow? #362

Open vsoch opened 1 year ago

vsoch commented 1 year ago

I've finally gotten Balsam running in my containers, but I'm having trouble understanding auth. When I do the login command, I get server errors:

# balsam login --url ${BALSAM_TEST_API_URL}
Logging into http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000
WARNING|balsam.client.requests_client:102] Attempt retry (0 of 10) of connection: 500 Server Error: Internal Server Error for url: http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000/auth/device/login
WARNING|balsam.client.requests_client:102] Attempt retry (1 of 10) of connection: 500 Server Error: Internal Server Error for url: http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000/auth/device/login
WARNING|balsam.client.requests_client:102] Attempt retry (2 of 10) of connection: 500 Server Error: Internal Server Error for url: http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000/auth/device/login
WARNING|balsam.client.requests_client:102] Attempt retry (3 of 10) of connection: 500 Server Error: Internal Server Error for url: http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000/auth/device/login
WARNING|balsam.client.requests_client:102] Attempt retry (4 of 10) of connection: 500 Server Error: Internal Server Error for url: http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000/auth/device/login

And the server logs:

[2023-06-09 17:36:40 +0000] [17] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UndefinedTable: relation "device_code_attempts" does not exist
LINE 1: INSERT INTO device_code_attempts (client_id, expiration, dev...
                    ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 404, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 269, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 124, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 68, in __call__
    response = await self.dispatch_func(request, call_next)
  File "/balsam/balsam/server/utils/timer.py", line 44, in dispatch
    response = await call_next(request)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 46, in call_next
    raise app_exc
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 36, in coro
    await self.app(scope, request.receive, send_stream.send)
  File "/usr/local/lib/python3.10/site-packages/starlette/exceptions.py", line 93, in __call__
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/exceptions.py", line 82, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 670, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 266, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 65, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 227, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 162, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/usr/local/lib/python3.10/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/balsam/balsam/server/auth/device_code_login.py", line 75, in authorization_request
    users.create_device_code_attempt(
  File "/balsam/balsam/server/models/crud/users.py", line 55, in create_device_code_attempt
    db.flush()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3386, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3525, in _flush
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3486, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute
    rec.execute(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
    util.preloaded.orm_persistence.save_obj(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    _emit_insert_statements(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 1097, in _emit_insert_statements
    c = connection._execute_20(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 333, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2124, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "device_code_attempts" does not exist
LINE 1: INSERT INTO device_code_attempts (client_id, expiration, dev...
                    ^

[SQL: INSERT INTO device_code_attempts (client_id, expiration, device_code, user_code, scope, user_denied, user_id) VALUES (%(client_id)s, %(expiration)s, %(device_code)s, %(user_code)s, %(scope)s, %(user_denied)s, %(user_id)s)]
[parameters: {'client_id': UUID('32160413-b606-4115-9f35-f38e7de91189'), 'expiration': datetime.datetime(2023, 6, 9, 17, 41, 40, 612787), 'device_code': 'nWnOmCAWOAoALUyH_ZGbod0VqLBfI9pDqltja08kr9aI9aWra0Kry02RLHeNokJfU_8mF_hFmDkeMqPBPHFDPbEe125WFrFvRHM6YOwVqUp1DsEYm-VUUB1kK5CBK8cDJ4JPHbMmUgj-9bjI2k_FZ ... (44 characters truncated) ... 2SxciUbY5pd1FajvR6takga87-g11Fnr1RhxiI25aFMp38ST7yFgiqtTvc09aME04N7Zyw-QCjOloFinUD0PhxS7Vw0SoggdR6ac0t0yRMK6yB4jaCIl6h_1phq0GdQLrfZk6vJLCg1dDoX3Gaodw', 'user_code': 'GHZK-FPFL', 'scope': '', 'user_denied': False, 'user_id': None}]
(Background on this error at: https://sqlalche.me/e/14/f405)

Probably the main issue is:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "device_code_attempts" does not exist

This should have been created here: https://github.com/argonne-lcf/balsam/blob/72a6e3d8d70759d1e4b746c8e1a75180ad303a7c/balsam/server/models/alembic/versions/f8fbad8262e3_initial.py#LL30C1-L30C1

Running the migration command with gunicorn as a prefix seems to exit with 1, but without, seems to do the migrations:

gunicorn balsam server migrate
root@flux-sample-services:/balsam# echo $?
1
root@flux-sample-services:/balsam# balsam server migrate
Running alembic migrations for postgresql://postgres:postgres@localhost:5432/balsam
INFO|balsam.util.postgres:137] Running DB migrations in /balsam/balsam/server/models/alembic
INFO|balsam.server.models.base:21] Creating DB engine: postgresql://postgres:postgres@localhost:5432/balsam
INFO|balsam.server.models.alembic.env:13] Alembic running migrations with DB engine: Engine(postgresql://postgres:***@localhost:5432/balsam)
Migrations complete!

That seems to at least get around these issues of the server error, but now there is another bug:

Logging into http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000
Logging into Balsam API at http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000
To proceed, please navigate to: https://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000/auth/oauth/ALCF/login/device?user_code=WCFR-LQRQ
Authenticate with your credentials then come back here!
Waiting for user log in...  -WARNING|balsam.client.requests_client:102] Attempt retry (0 of 10) of connection: 400 Client Error: Bad Request
{'detail': 'authorization_pending'} for url: http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000/auth/device/token
\WARNING|balsam.client.requests_client:102] Attempt retry (1 of 10) of connection: 400 Client Error: Bad Request
{'detail': 'authorization_pending'} for url: http://flux-sample-services.flux-service.flux-operator.svc.cluster.local:8000/auth/device/token

Would it be possible to have a headless mode? I don't see how this would work in an automated workflow without that. Thanks!

masalim2 commented 1 year ago

Hi! The Balsam login methods are configurable, and the client should poll the server for which method it wants to use. The example configuration in .env.example has multiple login flows enabled, and it is attempting the preferred interactive device login flow:

export BALSAM_AUTH_LOGIN_METHODS='["password", "oauth_authcode", "oauth_device"]'

If you change that to just ["password"] and double check that the server process is really picking up that setting (it should show up in the docker logs on server startup from this line), you should be able to sidestep having to set up OAuth!

This login method should tell the balsam login client CLI to just ask the user for a password. You could make the access token longer lived by increasing this setting:

export BALSAM_AUTH_TOKEN_TTL=259200

For a totally headless workflow that never requires re-authenticating users interactively, this is the relevant section of the login CLI: https://github.com/argonne-lcf/balsam/blob/72a6e3d8d70759d1e4b746c8e1a75180ad303a7c/balsam/cmdline/login.py#L39-L42

you might consider adding a new subclass of the password-based client with the refresh_auth and interactive_login methods overriden to read the credentials from a file or environment variable instead of this: https://github.com/argonne-lcf/balsam/blob/72a6e3d8d70759d1e4b746c8e1a75180ad303a7c/balsam/client/requests_password.py#L68-L74

You would then want to update the client_class key in ~/.balsam/client.yml to ensure that when balsam clients start up, they use the new headless-auth subclass.

vsoch commented 1 year ago

@masalim2 I'm working on this now - is there is a way to disable requiring ssl /https?

vsoch commented 1 year ago

oup nevermind, I was using the -u parameter incorrectly (for login it's the url, and for register it's the username).

vsoch commented 1 year ago

@masalim2 should I also be making a new balsam/server/auth/headless_login.py? I was going to use the password_login.py, but the endpoints there seem to have an oauth flavor still.

masalim2 commented 1 year ago

Hi @vsoch, sorry this slipped through the cracks. Despite the name of that form parameter OAuth2PasswordRequestForm, the implementation in password_login.py is just a basic client-to-server HTTP POST with a username and password, no other fancy Oauth flows involved. The server just checks the password hash locally, there is no external authorzation server involved. Hopefully that is sufficient!