Open icj217 opened 1 year ago
You're not alone in this. We would like to prune the overall package size. It should be easy enough to remove the pandas dependency. But we depend on arrow, which also requires numpy. It's not clear to me how we can work around that.
I know this is not a solution, but I also went through this rabbit hole while trying to make it work with Lambdas.
One solution is using Docker images on Lambda: https://docs.aws.amazon.com/lambda/latest/dg/images-create.html
Keep in mind that the larger the image the longer the cold start will be on the Lambda, which kinda defeats the purpose, but it is a valid solution for running code that bundles to more than 250Mb.
Ended up going with docker on ECS but you can make it work on Lambda.
Just coming across this same issue now. I'd hoped that there was a public Lambda layer built with all this in it already (which is how I get round this with Datadog) but seems not 😢
Looks like ECS for me...
Just coming across this same issue now. I'd hoped that there was a public Lambda layer built with all this in it already (which is how I get round this with Datadog) but seems not 😢
Looks like ECS for me...
The 250MB limit includes lambda layers, so either Docker image on lambda or ECS.
Hey all, just writing to let you know that this is now an area of focus for us in engineering. We're working up the design changes to get our default install size to a much more reasonable (hoping for <60mb, perhaps smaller).
Just wanted to see if this is still a focus on the Databricks end?
We currently have a large number of ETL pipelines where part of the processing happens in AWS Lambda. The fact that the installed source of this package is too large excludes us from writing the our Delta Lake using AWS Lambda.
The workarounds (such as using a docker image) are viable but very annoying and inefficient. Our build process is twice is along and much more complex compared to using a 'simple' AWS function.
I'm not working on this connector anymore. But I've learned some things since I wrote this comment which I might as well record here, in case they help someone:
This connector fetches results from Databricks in Apache Arrow format. Arrow is a column-based binary format that is very efficient across the wire when compared to alternatives like JSON or CSV. This saves you network bandwidth if you are pulling millions of rows per-query. And if your Python application already works with Arrow tables (pandas
supports this) then you can get those directly using the _arrow()
fetch methods, avoiding the CPU-time of deserializing the results.
If your application doesn't use Arrow tables, the results must be deserialized from Arrow → Python built-in types. Deserialization is tricky because, if not done correctly, numeric results can lose precision.
An additional complicating factor is that larger result sets are sent in batches that must be re-assembled by the connector.
Databricks SQL Connector uses pyarrow
to stitch together Arrow batches and to deserialize them on-demand. pyarrow
and its dependencies constitute the bulk of the installed size. Without pyarrow
the package would drop below 100MB.
The basic approach to making pyarrow
optional is to replace what pyarrow
does for free:
Item 3 is no small task, which is why this effort could take some time.
Thanks @susodapop for providing some more insight on why this is a big change!
We ran into this issue yesterday, and solved it by deploying a Lambda container image. Since we are using the AWS CDK, the changes required to deploy as a Lambda container image were trivial (see figures below). It's true the start up time is not as great as a standard Lambda runtime, but we don't think this difference will have any impact on us.
@MichaelAnckaert @icj217 (cc: @gopalldb @yunbodeng-db ) We have recently implemented the the split for the databricks-sql-python, to have an complete version and to have an non pyarrow based databricks-sql-connector-core ( smaller package ). This is the lean version of the python connector without Pyarrow - https://pypi.org/project/databricks-sql-connector-core/ . Do let us know if this works
Has anyone else attempted to use this package in an AWS lambda function, only to find that this package and its dependencies result in a deployment package larger than the AWS limit of 250MB?
It seems like pandas + pyarrow + numpy are the main culprits of this.