databricks / databricks-sql-python

Databricks SQL Connector for Python
Apache License 2.0
168 stars 94 forks source link

Incompatible with AWS Lambda due to >250MB in dependencies #143

Open icj217 opened 1 year ago

icj217 commented 1 year ago

Has anyone else attempted to use this package in an AWS lambda function, only to find that this package and its dependencies result in a deployment package larger than the AWS limit of 250MB?

It seems like pandas + pyarrow + numpy are the main culprits of this.

❯ python -m virtualenv .venv
...
❯ source .venv/bin/activate
...
❯ python3 -m pip install 'databricks-sql-connector==2.6.0'
...
❯ du -sh .venv/lib/python3.8/site-packages
297M    .venv/lib/python3.8/site-packages
susodapop commented 1 year ago

You're not alone in this. We would like to prune the overall package size. It should be easy enough to remove the pandas dependency. But we depend on arrow, which also requires numpy. It's not clear to me how we can work around that.

cdonate commented 1 year ago

I know this is not a solution, but I also went through this rabbit hole while trying to make it work with Lambdas.

One solution is using Docker images on Lambda: https://docs.aws.amazon.com/lambda/latest/dg/images-create.html

Keep in mind that the larger the image the longer the cold start will be on the Lambda, which kinda defeats the purpose, but it is a valid solution for running code that bundles to more than 250Mb.

Ended up going with docker on ECS but you can make it work on Lambda.

rpanman-sonatype commented 1 year ago

Just coming across this same issue now. I'd hoped that there was a public Lambda layer built with all this in it already (which is how I get round this with Datadog) but seems not 😢

Looks like ECS for me...

cdonate commented 1 year ago

Just coming across this same issue now. I'd hoped that there was a public Lambda layer built with all this in it already (which is how I get round this with Datadog) but seems not 😢

Looks like ECS for me...

The 250MB limit includes lambda layers, so either Docker image on lambda or ECS.

susodapop commented 11 months ago

Hey all, just writing to let you know that this is now an area of focus for us in engineering. We're working up the design changes to get our default install size to a much more reasonable (hoping for <60mb, perhaps smaller).

MichaelAnckaert commented 8 months ago

Just wanted to see if this is still a focus on the Databricks end?

We currently have a large number of ETL pipelines where part of the processing happens in AWS Lambda. The fact that the installed source of this package is too large excludes us from writing the our Delta Lake using AWS Lambda.

The workarounds (such as using a docker image) are viable but very annoying and inefficient. Our build process is twice is along and much more complex compared to using a 'simple' AWS function.

susodapop commented 8 months ago

I'm not working on this connector anymore. But I've learned some things since I wrote this comment which I might as well record here, in case they help someone:

This connector fetches results from Databricks in Apache Arrow format. Arrow is a column-based binary format that is very efficient across the wire when compared to alternatives like JSON or CSV. This saves you network bandwidth if you are pulling millions of rows per-query. And if your Python application already works with Arrow tables (pandas supports this) then you can get those directly using the _arrow() fetch methods, avoiding the CPU-time of deserializing the results.

If your application doesn't use Arrow tables, the results must be deserialized from Arrow → Python built-in types. Deserialization is tricky because, if not done correctly, numeric results can lose precision.

An additional complicating factor is that larger result sets are sent in batches that must be re-assembled by the connector.

Databricks SQL Connector uses pyarrow to stitch together Arrow batches and to deserialize them on-demand. pyarrow and its dependencies constitute the bulk of the installed size. Without pyarrow the package would drop below 100MB.

The basic approach to making pyarrow optional is to replace what pyarrow does for free:

  1. Optionally fetch results in a format other than Arrow
  2. Implement a mechanism to stitch partial results into a complete result
  3. Implement deserialization logic for converting to Python built-in types

Item 3 is no small task, which is why this effort could take some time.

MichaelAnckaert commented 8 months ago

Thanks @susodapop for providing some more insight on why this is a big change!

revanshine commented 5 months ago

We ran into this issue yesterday, and solved it by deploying a Lambda container image. Since we are using the AWS CDK, the changes required to deploy as a Lambda container image were trivial (see figures below). It's true the start up time is not as great as a standard Lambda runtime, but we don't think this difference will have any impact on us.

image1

image2

jprakash-db commented 2 weeks ago

@MichaelAnckaert @icj217 (cc: @gopalldb @yunbodeng-db ) We have recently implemented the the split for the databricks-sql-python, to have an complete version and to have an non pyarrow based databricks-sql-connector-core ( smaller package ). This is the lean version of the python connector without Pyarrow - https://pypi.org/project/databricks-sql-connector-core/ . Do let us know if this works