PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
16.43k stars 1.59k forks source link

2.0b11 throws sql alchemy migration error when deploying (409 conflict error after db upgrade) #6046

Closed caleb-recursion closed 2 years ago

caleb-recursion commented 2 years ago

I am using the python API to submit a Deployment. The following migration error is thrown.

Traceback (most recent call last):
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 100, in execute
    self._adapt_connection._handle_exception(error)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 228, in _handle_exception
    raise error
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 82, in execute
    self.await_(_cursor.execute(operation, parameters))
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 68, in await_only
    return current.driver.switch(awaitable)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 121, in greenlet_spawn
    value = await result
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/aiosqlite/cursor.py", line 37, in execute
    await self._execute(self._cursor.execute, sql, parameters)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/aiosqlite/cursor.py", line 31, in _execute
    return await self._conn._execute(fn, *args, **kwargs)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/aiosqlite/core.py", line 129, in _execute
    return await future
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/aiosqlite/core.py", line 102, in run
    result = function()
sqlite3.OperationalError: no such table: flow_run_alert_policy

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "prefect_submit.py", line 197, in <module>
    app.run(run)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "prefect_submit.py", line 193, in run
    asyncio.run(main(script_name, ecr_docker_image_uri))
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "prefect_submit.py", line 180, in main
    deployment_id = await spec.create(client=client)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/client.py", line 105, in with_injected_client
    return await fn(*args, **kwargs)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/deployments.py", line 226, in create
    infrastructure_document_id = await infrastructure._save(is_anonymous=True)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/blocks/core.py", line 555, in _save
    async with prefect.client.get_client() as client:
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/client.py", line 1898, in __aenter__
    self._ephemeral_lifespan = await self._exit_stack.enter_async_context(
  File "/usr/lib/python3.8/contextlib.py", line 568, in enter_async_context
    result = await _cm_type.__aenter__(cm)
  File "/usr/lib/python3.8/contextlib.py", line 171, in __aenter__
    return await self.gen.__anext__()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/client.py", line 184, in app_lifespan_context
    await context.__aenter__()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/asgi_lifespan/_manager.py", line 92, in __aenter__
    await self._exit_stack.aclose()
  File "/usr/lib/python3.8/contextlib.py", line 621, in aclose
    await self.__aexit__(None, None, None)
  File "/usr/lib/python3.8/contextlib.py", line 679, in __aexit__
    raise exc_details[1]
  File "/usr/lib/python3.8/contextlib.py", line 662, in __aexit__
    cb_suppress = await cb(*exc_details)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/asgi_lifespan/_concurrency/asyncio.py", line 80, in __aexit__
    await self.task
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/asgi_lifespan/_manager.py", line 90, in __aenter__
    await self.startup()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/asgi_lifespan/_manager.py", line 36, in startup
    raise self._app_exception
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/asgi_lifespan/_concurrency/asyncio.py", line 63, in run_and_silence_cancelled
    await self.coroutine()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/asgi_lifespan/_manager.py", line 64, in run_app
    await self.app(scope, self.receive, self.send)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/fastapi/applications.py", line 269, in __call__
    await super().__call__(scope, receive, send)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/starlette/applications.py", line 124, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/starlette/middleware/errors.py", line 149, in __call__
    await self.app(scope, receive, send)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/starlette/middleware/cors.py", line 76, in __call__
    await self.app(scope, receive, send)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/starlette/exceptions.py", line 69, in __call__
    await self.app(scope, receive, send)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/starlette/routing.py", line 659, in __call__
    await self.lifespan(scope, receive, send)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/starlette/routing.py", line 635, in lifespan
    async with self.lifespan_context(app):
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/starlette/routing.py", line 530, in __aenter__
    await self._router.startup()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/starlette/routing.py", line 612, in startup
    await handler()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/orion/api/server.py", line 266, in run_migrations
    await db.create_db()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/orion/database/interface.py", line 53, in create_db
    await self.run_migrations_upgrade()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/orion/database/interface.py", line 61, in run_migrations_upgrade
    await run_sync_in_worker_thread(alembic_upgrade)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 56, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(call, cancellable=True)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/orion/database/alembic_commands.py", line 29, in alembic_upgrade
    alembic.command.upgrade(alembic_config(), revision, sql=dry_run)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/command.py", line 322, in upgrade
    script.run_env()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/script/base.py", line 569, in run_env
    util.load_python_file(self.dir, "env.py")
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 94, in load_python_file
    module = load_module_py(module_id, path)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 110, in load_module_py
    spec.loader.exec_module(module)  # type: ignore
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/orion/database/migrations/env.py", line 98, in <module>
    apply_migrations()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 189, in wrapper
    return run_async_from_worker_thread(async_fn, *args, **kwargs)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 136, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/anyio/from_thread.py", line 49, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
    return f.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/orion/database/migrations/env.py", line 92, in apply_migrations
    await connection.run_sync(do_run_migrations)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/ext/asyncio/engine.py", line 546, in run_sync
    return await greenlet_spawn(fn, conn, *arg, **kw)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 128, in greenlet_spawn
    result = context.switch(value)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/orion/database/migrations/env.py", line 80, in do_run_migrations
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/runtime/environment.py", line 853, in run_migrations
    self.get_context().run_migrations(**kw)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/runtime/migration.py", line 623, in run_migrations
    step.migration_fn(**kw)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/orion/database/migrations/versions/sqlite/2022_05_30_100855_d76326ed0d06_rename_run_alerts_to_run_notifications.py", line 19, in upgrade
    op.rename_table("flow_run_alert_policy", "flow_run_notification_policy")
  File "<string>", line 8, in rename_table
  File "<string>", line 3, in rename_table
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/operations/ops.py", line 1396, in rename_table
    return operations.invoke(op)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/operations/base.py", line 394, in invoke
    return fn(self, operation)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/operations/toimpl.py", line 122, in rename_table
    operations.impl.rename_table(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/ddl/impl.py", line 346, in rename_table
    self._exec(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/alembic/ddl/impl.py", line 195, in _exec
    return conn.execute(construct, multiparams)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/future/engine.py", line 280, in execute
    return self._execute_20(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/sql/ddl.py", line 80, in _execute_on_connection
    return connection._execute_ddl(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1398, in _execute_ddl
    ret = self._execute_context(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 100, in execute
    self._adapt_connection._handle_exception(error)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 228, in _handle_exception
    raise error
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 82, in execute
    self.await_(_cursor.execute(operation, parameters))
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 68, in await_only
    return current.driver.switch(awaitable)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 121, in greenlet_spawn
    value = await result
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/aiosqlite/cursor.py", line 37, in execute
    await self._execute(self._cursor.execute, sql, parameters)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/aiosqlite/cursor.py", line 31, in _execute
    return await self._conn._execute(fn, *args, **kwargs)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/aiosqlite/core.py", line 129, in _execute
    return await future
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/aiosqlite/core.py", line 102, in run
    result = function()
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: flow_run_alert_policy
[SQL: ALTER TABLE flow_run_alert_policy RENAME TO flow_run_notification_policy]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
caleb-recursion commented 2 years ago

I got past the above error by manually deleting the ~/.prefect/orion.db file, and running prefect orion database reset. Now the deployment creation is throwing a 409 conflict error.


Traceback (most recent call last):
  File "prefect_submit.py", line 197, in <module>
    app.run(run)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "prefect_submit.py", line 193, in run
    asyncio.run(main(script_name, ecr_docker_image_uri))
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "prefect_submit.py", line 180, in main
    deployment_id = await spec.create(client=client)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/client.py", line 105, in with_injected_client
    return await fn(*args, **kwargs)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/deployments.py", line 233, in create
    return await client.create_deployment(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/client.py", line 1271, in create_deployment
    response = await self._client.post(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/httpx/_client.py", line 1842, in post
    return await self.request(
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/httpx/_client.py", line 1527, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/client.py", line 278, in send
    response.raise_for_status()
  File "/home/rxrx/git/mousera2/build/venv/lib/python3.8/site-packages/prefect/client.py", line 224, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Client error '409 Conflict' for url 'https://api-beta.prefect.io/api/accounts/cb6c3dbf-c8d9-4435-ab0d-08783d7d1dc3/workspaces/f0694c09-fb21-4f56-a735-17a25f7d9e56/deployments/'
Response: {'detail': 'Data integrity conflict. This usually means a unique or foreign key constraint was violated. See server logs for details.'}
For more information check: https://httpstatuses.com/409

Everything is on 2.0b11

caleb-recursion commented 2 years ago

I tried deleting and recreating the workspace in the cloud with no luck.

abrookins commented 2 years ago

Hi Caleb! Thanks for reporting this. We're currently making a lot of changes, and some of them are breaking changes, so "pardon our dust." It won't last much longer!

I couldn't reproduce that error with a simple deployment and flow I created for this purpose, so I'm probably not hitting the same code path you did. Would you mind sharing an example flow and deployment that reproduce the error?

caleb-recursion commented 2 years ago

Hi @abrookins . Thanks for getting back. I understand you're making changes. Is beta.prefect.io only supporting the newest version? Is there generally meant to be backward compatability or no?

Ill paste a couple important files, you'd need our build system and a kubernetes cluster to run on, which are harder to share. But maybe this is enough. The main flow, prefect_submit.py is called on a script containing a flow, like sample_flow.py below. The flow is packaged in a docker container with all of its dependencies, pushed to an ECR repo, the prefect client is created dynamically, and the deployment is submitted via the python api. Other notes: There is a utils script that sets up the k8s yamls, and a shim Packager class that mimics your docker packager, but rather than packaging it just points to the location of the script in the existing container. The 409 seems to be dying on the the deployment creation. I'm using the newest Deployment class.

prefect_submit.py

import prefect.utilities.filesystem
prefect.utilities.filesystem.to_display_path.__code__ = (lambda x: str(x)).__code__

import asyncio
import os
import re

import prefect
from absl import app, flags
from common import log
from common.deploy import package, utils
from prefect.cli.cloud import build_url_from_workspace, get_cloud_client
from prefect.deployments import Deployment, FlowScript
from prefect.exceptions import ObjectAlreadyExists
from prefect.flow_runners import KubernetesFlowRunner
from prefect.packaging.base import Packager
from prefect.packaging.docker import DockerPackager
from pydantic import validator

FLAGS = flags.FLAGS  
flags.DEFINE_boolean('gpu', True, 'Run the flow with a GPU image')

JOB_RESTART_POLICY = "Never"
IMAGE_PULL_POLICY = "Always"

def is_valid_target_name(target_name):
    mousera2_build_root = os.environ['MOUSERA2_ROOT']
    cmakecache = open(f'{mousera2_build_root}/CMakeCache.txt', 'r')
    log.check_notnone(cmakecache)
    is_valid = False
    for line in cmakecache:
        if target_name in line:
            is_valid = True
            break
    return is_valid

def remove_prefix(text, prefix):
    if text.startswith(prefix):
        return text[len(prefix):]
    return text

def infer_cmake_target_from_filename(python_file):
    """
    Args:
        python_file : the path to the python file containing the prefect flow program
    Returns:
        The dotted-cmake target name (e.g. storage.accounting.experiment_cost_report)
    Current huristics in order are:
        1. Check if there is a cmake target with the same name as the python file (minus the .py extension).
    """
    mousera2_source_root = os.environ['MOUSERA2_SOURCE_ROOT']
    full_path = os.path.abspath(python_file)
    filename = os.path.basename(full_path)
    filename = filename.replace('.py', '')
    filepath = os.path.relpath(os.path.dirname(full_path), mousera2_source_root)
    filepath = filepath.replace('/', '.')
    target_name = f'{filepath}.{filename}'

    target_name = remove_prefix(target_name, 'build.')

    if not is_valid_target_name(target_name):
        target_name = None

    return target_name

def build_docker_env(script_target_name):
    mousera2_build_root = os.environ['MOUSERA2_ROOT']
    log.check_notnone(mousera2_build_root)
    base_image = get_worker_image(gpu=FLAGS.gpu)
    print("Building flow with base image: ", base_image)
    # NOTE: Prefect images must have an undefined entrypoint... this is done by setting entry=None below
    image_name, image_tag = package.package_in_docker(script_target_name, entry=None, base_image=base_image)
    uris = utils.upload(image_name, image_tag, image_tag, ['ecr'])
    uri = None
    if uris:
        uri = uris[0]
    return uri

async def set_workspace(api_key: str):
    cloud = get_cloud_client(api_key=api_key)
    workspaces = await cloud.read_workspaces()
    if len(workspaces) == 0:
        raise RuntimeError("No workspaces in prefect cloud, need to create a workspace at beta.prefect.io .")
    workspace = workspaces[0]
    print(f"WARNING, defaulting to workspace '{workspace['workspace_handle']}'."
          "The current version of beta.prefect.io only allows a single workspace. "
          "This will need to be changed in the future.")
    return build_url_from_workspace(workspace)

async def initialize_client():
    api_key = os.environ.get("PREFECT2_API_KEY")
    workspace_url = await set_workspace(api_key)
    client = prefect.client.OrionClient(api=workspace_url, api_key=api_key)
    try:
        await client.create_work_queue("kubernetes")
        print("Created work queue")
    except ObjectAlreadyExists:
        print("Work queue already exists")
    return client

class Packager(DockerPackager):

    image_reference: str
    image_flow_location: str

    @validator('image_reference')
    def set_image_ref(cls, v):
        return v

    @validator('image_flow_location')
    def set_flow_location(cls, v):
        return v

    def __new__(cls, *args, **kwargs):
        return super().__new__(cls)

    async def package(self, flow):
        return self.base_manifest(flow).finalize(image=self.image_reference, 
        image_flow_location=self.image_flow_location)

def get_worker_image(gpu=True):
    mousera2_source_root = os.environ['MOUSERA2_SOURCE_ROOT']
    cur_base_image_record = f'{mousera2_source_root}/ops/images/docker/runtime_{"gpu" if gpu else "cpu"}.txt'
    base_image = open(cur_base_image_record, 'r').read().strip()
    return base_image

def create_deployment_spec(flow_script, flow_image):
    RUNNER_ENV = {"WORKER_IMAGE": flow_image,
                  "USE_GPU": FLAGS.gpu,
                  "PREFECT_LOGGING_LEVEL": "DEBUG",
                  "DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING": True,
                  "DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES": 5}
    print("FLOW SCRIPT", flow_script)
    return Deployment(
        flow=FlowScript(path=flow_script),
        flow_runner=KubernetesFlowRunner(image_pull_policy=IMAGE_PULL_POLICY,
                                         env=RUNNER_ENV),
        packager=Packager(image_reference=flow_image, 

        image_flow_location=f"/root/app/tools/eks/flows/{flow_script}")
    )

def get_account_id(client):
    account_id = re.search('accounts/(.*)/workspaces', str(client.api_url))
    if account_id:
        return account_id.group(1)
    else:
        raise NotImplementedError("Extracting user account ID failed, "
                                  "prefect has change something in their API, need to update this.")

def get_workspace_id(client):
    account_id = re.search('workspaces/(.*)/', str(client.api_url))
    if account_id:
        return account_id.group(1)
    else:
        raise NotImplementedError("Extracting workspace ID failed, "
                                  "prefect has change something in their API, need to update this.")

def format_flow_url(account_id, workspace_id, flow_run_id):
    return f"https://beta.prefect.io/account/{account_id}/workspace/{workspace_id}/flow-run/{flow_run_id}"

async def main(input_script: str, image_uri: str):
    client = await initialize_client()
    spec = create_deployment_spec(input_script, image_uri)
    # await spec.validate(client=client)
    deployment_id = await spec.create(client=client)
    run_info = await client.create_flow_run_from_deployment(deployment_id)
    account_id = get_account_id(client)
    workspace_id = get_workspace_id(client)
    flow_run_url = format_flow_url(account_id, workspace_id, run_info.state.state_details.flow_run_id)
    print(f"Your flow '{run_info.name}' is running at: ")
    print(flow_run_url)

def run(argv):
    script_name = argv[1]
    prefect_script_target_name = infer_cmake_target_from_filename(script_name)
    ecr_docker_image_uri = build_docker_env(prefect_script_target_name)
    asyncio.run(main(script_name, ecr_docker_image_uri))

if __name__ == "__main__":
    app.run(run)

flow_utils.py

import os
from dataclasses import dataclass

import yaml
from prefect.task_runners import BaseTaskRunner, ConcurrentTaskRunner
from prefect_dask import DaskTaskRunner

@dataclass
class ResourceConfig:
    always_pull_image: bool = False
    max_workers: int = 1
    min_workers: int = 1
    cpus_per_worker: int = 1
    threads_per_worker: int = 1
    memory_per_worker: str = "1G"
    gpus_per_worker: int = 1

def generate_task_runner(config: ResourceConfig) -> BaseTaskRunner:
    try:
        worker_image = os.environ['WORKER_IMAGE']
    except KeyError:
        return ConcurrentTaskRunner()
    gpu_limit = 0
    worker_name = "dask-worker"
    if os.environ["USE_GPU"].lower() in ('true', '1', 't'):
        worker_name = "dask-cuda-worker"
        gpu_limit = config.gpus_per_worker

    cmd = f"[{worker_name}, $(DASK_SCHEDULER_ADDRESS), --nthreads, '{config.threads_per_worker}', --no-dashboard, --memory-limit, {config.memory_per_worker}B]"

    image_pull_policy = "Always" if config.always_pull_image else "IfNotPresent"

    worker_spec_yaml = f"""
    kind: Pod
    spec:
      automountServiceAccountToken: false
      restartPolicy: OnFailure
      containers:
      - image: {worker_image}
        imagePullPolicy: {image_pull_policy}
        args: {cmd}
        name: "{worker_name}"
        resources:
          limits:
            cpu: "{config.cpus_per_worker}"
            memory: {config.memory_per_worker}
            nvidia.com/gpu: {gpu_limit}
          requests:
            cpu: "{config.cpus_per_worker}"
            memory: {config.memory_per_worker}
            nvidia.com/gpu: {gpu_limit}
    """

    scheduler_spec_yaml = f"""
    kind: Pod
    spec:
      automountServiceAccountToken: false
      restartPolicy: OnFailure
      containers:
      - image: {worker_image}
        imagePullPolicy: {image_pull_policy}
        args: [dask-scheduler]
        name: "dask-scheduler"
        resources:
          limits:
            cpu: "1"
            memory: 1G
          requests:
            cpu: "0.5"
            memory: 1G
    """

    task_runner = DaskTaskRunner(cluster_class="dask_kubernetes.KubeCluster",
                                 cluster_kwargs={
                                     "pod_template": yaml.safe_load(worker_spec_yaml),
                                     "scheduler_pod_template": yaml.safe_load(scheduler_spec_yaml)
                                 },
                                 adapt_kwargs={"maximum": config.max_workers, "minimum": config.min_workers})
    return task_runner

sample_flow.py

import time

from prefect import flow, task

import tools.eks.flows.flow_utils as flu

resources = flu.ResourceConfig(
    max_workers=10,
    min_workers=2,
    cpus_per_worker=2,
    memory_per_worker="14G"
)

task_runner = flu.generate_task_runner(resources)

"""
A Task is a single unit of work. 1 per worker with the above config.
"""
@task(retries=3)
def gpu_work(task_id):
    import torch
    matrix_size = (100, 100)
    device_name = "cuda" if torch.cuda.is_available() else "CPU"
    device = torch.device(device_name)
    time.sleep(10)
    a = torch.rand(matrix_size, device=device)
    b = torch.rand(matrix_size, device=device)
    c = torch.matmul(a, b)
    print(f"TASK DEVICE: {device_name}")
    return task_id, device_name

"""
A Flow is a is a higher level orchestration of work. Flows can have tasks and sub flows.
"""
@flow(task_runner=task_runner)
def gpu_flow():
    try:
        TASK_COUNT = 1000
        print("STARTING FLOW", flush=True)
        results = []
        for task_id in range(TASK_COUNT):
            results.append(gpu_work(task_id))
        start = time.time()
        for result in results:
            print(result.result(), flush=True)
        duration = time.time() - start
        print(f"All done, took: {duration}", flush=True)
        return 'ok'
    except Exception as e:
        print("EXCEPTION", e)
caleb-recursion commented 2 years ago

Also note this piece of code at the top of prefect_submit.py

import prefect.utilities.filesystem
prefect.utilities.filesystem.to_display_path.__code__ = (lambda x: str(x)).__code__

Its required because prefect died on symlinked flows when bundling them as a Deployment. Unrelated, just a small issue.

caleb-recursion commented 2 years ago

I updated to beta 12 and changed the flow_runner to the KubernetesJob concept, same 409 conflict error.

abrookins commented 2 years ago

Thanks @caleb-recursion. What I need most is the Deployment spec you created that matches our latest version, with infrastructure instead of flow runners -- can you share that?

caleb-recursion commented 2 years ago
def base_job_manifest(envs):
    """Produces the bare minimum allowed Job manifest"""
    return {
        "apiVersion": "batch/v1",
        "kind": "Job",
        "metadata": {"labels": {}},
        "spec": {
            "template": {
                "spec": {
                    "parallelism": 1,
                    "completions": 1,
                    "restartPolicy": "Never",
                    "containers": [
                        {
                            "name": "prefect-job",
                            "env": [{"name": name, "value": value} for name, value in envs.items()],
                        }
                    ],
                }
            }
        },
    }

def create_deployment_spec(flow_script, flow_image):
    RUNNER_ENV = {"WORKER_IMAGE": flow_image,
                  "USE_GPU": FLAGS.gpu,
                  "PREFECT_LOGGING_LEVEL": "DEBUG",
                  "DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING": True,
                  "DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES": 5}
    return Deployment(
        name=f"{flow_script.split('.')[0]}-{time.time()}",
        flow=FlowScript(path=flow_script),
        infrastructure=KubernetesJob(image_pull_policy=IMAGE_PULL_POLICY,
                                         job=base_job_manifest(RUNNER_ENV)),
        packager=Packager(image_reference=flow_image, 

        image_flow_location=f"/root/app/tools/eks/flows/{flow_script}")
    )
abrookins commented 2 years ago

Thanks, @caleb-recursion -- this is useful, and what I meant was the Prefect Deployment definition that you used. The equivalent code (and the exact code that you're using) to this code that you shared previously:

def create_deployment_spec(flow_script, flow_image):
    RUNNER_ENV = {"WORKER_IMAGE": flow_image,
                  "USE_GPU": FLAGS.gpu,
                  "PREFECT_LOGGING_LEVEL": "DEBUG",
                  "DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING": True,
                  "DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES": 5}
    print("FLOW SCRIPT", flow_script)
    return Deployment(
        flow=FlowScript(path=flow_script),
        flow_runner=KubernetesFlowRunner(image_pull_policy=IMAGE_PULL_POLICY,
                                         env=RUNNER_ENV),
        packager=Packager(image_reference=flow_image, 

        image_flow_location=f"/root/app/tools/eks/flows/{flow_script}")
    )
caleb-recursion commented 2 years ago

Thats what I just shared ^, unless you want the full file again?

zangell44 commented 2 years ago

@caleb-recursion I think the root cause here is attempting to create a deployment using a custom OrionClient instance. There's a lot of logic for managing API keys / URLs that is ignored if you instantiate your own client. I can reproduce the specific issue here.

You're using a custom OrionClient object to create a deployment. As part of that creation, Prefect saves the corresponding infrastructure block. Saving a block triggers the instantiation of a prefect client using default logic from the environment https://github.com/PrefectHQ/orion/blob/main/src/prefect/blocks/core.py#L555. Based on the code, I don't PREFECT_API_KEY or PREFECT_API_URL have been set for this to work correctly (e.g. see them using PREFECT2_API_KEY). So what happens is

A simple way to reproduce this:

import asyncio
from prefect.client import OrionClient
from prefect import flow, Deployment

@flow
def foo(x: int = 1):
    pass

d1 = Deployment(flow=foo, name="test")

async def main():
    # instantiate an orion client using custom api url and api key
    # note this is NOT THE SAME AS SET IN THE CURRENT PROFILE
    #
    # the current profile is empty
    client = OrionClient(
        api="<CORRECT API URL FOR CLOUD WORKSPACE>",
        api_key="<A WORKING API KEY>",
    )

    await d1.create()
    print("This one works fine")

    await d1.create(client=client)
    print("This one has an error")

if __name__ == "__main__":
    asyncio.run(main())
zach@Zachs-MacBook-Pro ~/p/orion (sketch-out-or-filters)> prefect profile use tmp && prefect profile inspect tmp                                                                      (py39orion) 
Profile 'tmp' now active.
Profile 'tmp' is empty.
zach@Zachs-MacBook-Pro ~/p/orion > python error.py                                                                                                        (py39orion) 
This one works fine
Traceback (most recent call last):
  File "/Users/zach/prefect/orion/error.py", line 32, in <module>
    asyncio.run(main())
  File "/opt/homebrew/Caskroom/miniconda/base/envs/py39orion/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/py39orion/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/Users/zach/prefect/orion/user_error.py", line 27, in main
    await d1.create(client=client)
  File "/Users/zach/prefect/orion/src/prefect/client.py", line 104, in with_injected_client
    return await fn(*args, **kwargs)
  File "/Users/zach/prefect/orion/src/prefect/deployments.py", line 216, in create
    return await client.create_deployment(
  File "/Users/zach/prefect/orion/src/prefect/client.py", line 1263, in create_deployment
    response = await self._client.post(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/py39orion/lib/python3.9/site-packages/httpx/_client.py", line 1820, in post
    return await self.request(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/py39orion/lib/python3.9/site-packages/httpx/_client.py", line 1506, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/Users/zach/prefect/orion/src/prefect/client.py", line 277, in send
    response.raise_for_status()
  File "/Users/zach/prefect/orion/src/prefect/client.py", line 223, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Client error '409 Conflict' for url 'https://api-beta.prefect.io/api/accounts/<FOO>/workspaces/<FOO>/deployments/'
Response: {'detail': 'Data integrity conflict. This usually means a unique or foreign key constraint was violated. See server logs for details.'}
For more information check: https://httpstatuses.com/409

In general, I would recommend avoiding custom use of OrionClient and use profiles to manage API configuration

caleb-recursion commented 2 years ago

Thanks for explaining. Is it possible to achieve setting the API and workspace URL via a Python interface? Whether through a client, or profile?

caleb-recursion commented 2 years ago

I worked around it by calling a subprocess of the prefect login command. Thanks for finding the error.

zangell44 commented 2 years ago

Nice!

For setting within Python, you can also use the temporary_settings context manager

https://orion-docs.prefect.io/api-ref/prefect/settings/#prefect.settings.temporary_settings

zanieb commented 2 years ago

We'll definitely spend some time investigating ways to ensure the client is used consistently so things like this won't happen.