databricks / databricks-sql-python

Databricks SQL Connector for Python
Apache License 2.0
171 stars 96 forks source link

Slow engine creation - only when using FastAPI #422

Open jaames-bentley opened 3 months ago

jaames-bentley commented 3 months ago

Hello,

I am currently working on a FastAPI application which calls Databricks using databricks-sql-connector, however the code appears to slow down massively on the engine creation stage.

I've narrowed the issue down to .venv\Lib\site-packages\databricks\sql\thrift_api\TCLIService\ttypes.py, where when running in debug mode the script is loaded incredibly slowly (around 5-10 minutes). However when the exact same code is run outside of a FastAPI function (ie, in a standalone script, notebook, etc) it runs almost instantly and the engine is created in under 1 second.

The details of my test application are:

main.py:

from fastapi import FastAPI
from fastapi.responses import RedirectResponse
import uvicorn
from routes import test_route

app = FastAPI(
     title="TestAPI",
     version="1.0",
     description="Test API",
     )

@app.get("/", include_in_schema=False)
async def docs_redirect():
     return RedirectResponse(url='/docs')

app.include_router(test_route.router)

if __name__ == '__main__':
     uvicorn.run('main:app', host='0.0.0.0', port=8000)

test_route.py:

import fastapi
from fastapi import APIRouter

from sqlalchemy import create_engine
from sqlalchemy import text

router = APIRouter(tags=["Test"])

@router.post("/test_route", description="Test")
async def test():
    import time

    print("Connecting to engine....")
    startT = time.time()
    engine = create_engine(
        url = f"databricks://token:<<TOKEN>>@<<HOST>>?http_path=/sql/1.0/warehouses/<<WAREHOUSE>>&catalog=<<CATALOG>>&schema=<<SCHEMA>>"
    )
    print("Engine Connected!")
    print(f"Time taken: {time.time() - startT} seconds.")

    with engine.connect() as connection:
        result = connection.execute(text("SELECT * from range(10)"))
        for row in result:
            print(row)

    print(f"Execution time: {time.time() - startT} seconds.")
    pass

    return "Ok."

requirements.txt:

fastapi==0.111.1
uvicorn==0.30.3
databricks-sql-connector[SQLAlchemy]==3.3.0

The same issue occurs when using:

from databricks import sql

connection = sql.connect(
                        server_hostname = <HOST>,
                        http_path = <HTTP>
                        access_token = <TOKEN>)

cursor = connection.cursor()

cursor.execute("SELECT * from range(10)")
print(cursor.fetchall())

cursor.close()
connection.close()

but given it all goes to create_engine under the hood I was trying to simplify my test case.

As mentioned, the call stack points to .venv\Lib\site-packages\databricks\sql\thrift_api\TCLIService\ttypes.py taking all of this extra time. This can be fixed through editing ttypes.py directly and removing all rows which begin None, #. However this directly violates the warning given at the top of ttypes.py about not editing. Removing these rows reduces the file from ~100,000 lines to ~10,000, and completely solves the time delay issue when using FastAPI.

So my main questions are:

Thanks!

jaames-bentley commented 3 months ago

Feeling like a bit of a numpty - after a day of debugging I found the issue is also solved by using a different Python version.

Version originally used: 3.12.4 Version which fixes the issue: 3.11.9

I'll leave this issue up as the docs do say all versions of Python 3.8+ are supported, but feel free to close if no fix / update is planned.

kravets-levko commented 3 months ago

Hi @jaames-bentley! Thank you for reporting this issue (and for partially figuring it out while I wasn't around)!

Why does this delay in engine creation only occur in a FastAPI app (is it something to do with it being asynchronous?)

TBH, I don't know. There are many things that work different in FastAPI/uvicorn, comparing to WSGI server or CLI apps, and I'm not so deeply familiar with all the differences. it may be some kind of cache miss (Python should cache .pyc files after first import and then reuse it, but if your app spawns some background workers - they may have conflicts on this step). It may be something caused by async features of Python itself, or uvloop, or even both.

It would be perfect if you can create an example that stabily reproduces the issue (e.g. dockerfile). Or, at least, please share more details about your OS and Python version, other Python packages installed, how exactly you run your app (maybe, there are some command line switches passed ty Python interpreter, which change its behavior) - for both your environments.

Can ttypes.py be safely edited to remove all "None" rows or is this not a viable solution?

No, generally this file shouldn't be edited. Most of that None values are somehow used by the thrift package, and thus should be available. Probably there are some structures we don't use in out library, but it's hardly possible to figure that out

susodapop commented 2 months ago

I'm a bit late to the party here but this is an exact duplicate behaviour from here https://github.com/databricks/databricks-sql-python/issues/369#issuecomment-2000352199 and does appear to be an issue with certain other libraries that have not been updated to Python 3.12.