delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.14k stars 380 forks source link

Building delta-rs layer that's compatible with aws-sdk-pandas for AWS Lambda #1108

Open MrPowers opened 1 year ago

MrPowers commented 1 year ago

delta-rs was included as an optional dependency in aws-sdk-pandas, but that means it's still not included in the pre-built layer, so it's still hard to use delta-rs in AWS Lambda functions.

aws-sdk-pandas is a popular project because it includes pre-built layers in the releases, see here for an example. Building a Python layer is hard. You can't just build a delta-rs layer on a Mac and then upload it to AWS Lambda. Your layer needs to be built with a specific Linux version using Docker.

aws-sdk-pandas takes the painful "layer building" step out of AWS Lambda for Python programmers. Python programmers can simply grab the pre-built layer, attach it to their AWS Lambda environment, and immediately create a Lambda function that uses pandas. WIthout this pre-built layer, many Python programmers simply wouldn't be able to use Lambda. It's just too hard to build Python layers right (using Docker is hard and even when everything is done correctly, there can be size limit issues).

Multiple layers can be attached to an AWS Lambda function. Hopefully, we can just build a delta-rs AWS Layer that can be attached to an AWS Lamba function with aws-sdk-pandas and everything will just work. Here are the layers that are attached to one aws-sdk-pandas release for example (the project used to be called awswrangler):

I think we need to put in the legwork here, figure out how to build the release that works, and then write a blog post. Then we can figure out how to create some sort of CI task that automatically builds all the layers when a release is made.

nkarpov commented 1 year ago

I tried to get this working but was unable to do so, maybe this will help others:

(using x86_64 linux)

  1. Setup layer directory (https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html)
    mkdir python
    pip3 install deltalake -t python

Result:

[ec2-user@ip-172-31-62-15 site-packages]$ ls -l
total 92
drwxrwxr-x  2 ec2-user ec2-user    66 Jan 27 17:12 bin
drwxrwxr-x  3 ec2-user ec2-user   176 Jan 27 17:12 deltalake
drwxrwxr-x  3 ec2-user ec2-user   104 Jan 27 17:12 deltalake-0.7.0.dist-info
drwxrwxr-x 18 ec2-user ec2-user  4096 Jan 27 17:12 numpy
drwxrwxr-x  2 ec2-user ec2-user   158 Jan 27 17:12 numpy-1.21.6.dist-info
drwxrwxr-x  2 ec2-user ec2-user   120 Jan 27 17:12 numpy.libs
drwxrwxr-x 11 ec2-user ec2-user  4096 Jan 27 17:12 pyarrow
drwxrwxr-x  2 ec2-user ec2-user   148 Jan 27 17:12 pyarrow-11.0.0.dist-info
drwxrwxr-x  2 ec2-user ec2-user    46 Jan 27 17:12 __pycache__
drwxrwxr-x  2 ec2-user ec2-user    81 Jan 27 17:12 typing_extensions-4.4.0.dist-info
-rw-rw-r--  1 ec2-user ec2-user 80078 Jan 27 17:12 typing_extensions.py
  1. All those dependencies are part of aws-pandas-sdk, so zip up only deltalake

    zip -r layer.zip python/deltalake python/deltalake-0.7.0.dist-info
  2. Create the layer in AWS console using zip upload, select Python 3.7 & x86_64 platform

  3. Run a simple Lambda function, attach the aws-pandas-sdk layer & custom layer.

    
    import json

import deltalake

def lambda_handler(event, context):

TODO implement

return {
    'statusCode': 200,
    'body': json.dumps('Hello from Lambda!')
}

Following response:

Response { "errorMessage": "Unable to import module 'lambda_function': No module named 'pyarrow._dataset'", "errorType": "Runtime.ImportModuleError", "stackTrace": [] }



I've verified that the layer structure is the same as aws-pandas-sdk layer (all libs are just in `/python`), and that `pyarrow._dataset` is available in aws-pandas-sdk - but for some reason this is still happening. It's possible this error is bogus and has something to do with build environments. Also, if you try to zip up all the dependencies together (as opposed to just the deltalake), the zip file is too big (>55mb) and AWS won't let you upload that layer.

This will probably work if I tried to use docker to build a custom image for the whole lambda but that's much heavier. I'm hoping we can find a more user friendly way to make this thin layer people can just attach.
nkarpov commented 1 year ago

The pyarrow build we need just got merged into aws-sdk-pandas 🥳 🥳 https://github.com/aws/aws-sdk-pandas/pull/1977

Last step now is to create the builds for our layers in this repo. A few questions:

  1. Where should the repo should this live and how to build it? I think the approach aws-sdk-pandas is doing https://github.com/aws/aws-sdk-pandas/blob/main/building/build-lambda-layers.sh is a reasonable model to follow.
  2. How can we integrate this with existing deltalake release process? It would be great if the lambda layer zip archive was just automatically published as part of future releases.

@wjones127, any thoughts? Looking for some consensus before plowing ahead.

houqp commented 1 year ago

Where should the repo should this live and how to build it

I think we can just create a new top level folder within the repo for this

How can we integrate this with existing deltalake release process?

If this is for aws-sdk-pandas, then we could release it as part of the python release pipeline, it's just a bunch of extra binaries to build and publish right?

nkarpov commented 1 year ago

There are at least 2 cases I can think of for python users:

  1. Using deltalake with aws-sdk-pandas
  2. Using deltalake by itself

For (1), we can build and publish a deltalake layer without its dependencies, because they are included in aws-sdk-pandas.

For (2), users will still have to build a separate layer with pyarrow and numpy because the lambda layer constraints (<55mb) prevent us from bundling them in a single layer.

I will submit a PR for something for (1) in the python dir, since it's specific to supporting the python release. Perhaps simply add to the README in that directory since it already includes python build instructions? Whoever manages the release can publish this layer along with the remaining binaries.

For (2) we can publish the dependencies too, or just include the instructions. I'm in favor of publishing them for the sake of user experience (users who don't typically build packages are likely to get stuck), but understand if we want to keep the releases from this repo pure. I suspect most users will want all the extras that come with (1) in most cases anyway.

mattfysh commented 1 year ago

You can just squeeze deltalake and aws-sdk-pandas into a single lambda function together, but it's very close to breaching the 250mb limit. Follow the instructions here:https://delta.io/blog/2023-04-06-deltalake-aws-lambda-wrangler-pandas/

This is a great way to have lambda run lightweight queries on your delta tables

veronewra commented 6 months ago

Hey! I'm interested in contributing, and this is labeled as a good first issue- but I'm having trouble figuring out what is left to be done here? Sorry if this is a silly question!

tijmenr commented 1 month ago

You can just squeeze deltalake and aws-sdk-pandas into a single lambda function together, but it's very close to breaching the 250mb limit. Follow the instructions here: https://delta.io/blog/2023-04-06-deltalake-aws-lambda-wrangler-pandas/

It barely fit with last years' versions, but today, the deltalake 0.18.2 package for Python 3.8+, manylinux2014_x86_64 is 100MB (which is almost all in the _internal.abi3.so). The AWS SDK for Pandas (awswrangler) in its current version 3.9.0 is over 170MB. So, the combination already goes over the 250MB limit (and in practice, you maybe also want to have spare room for the AWS Powertools layer, providing typing and logging). The simple approach in the mentioned blog post no longer works. So it's back to trying to handcraft zip archives.

I'm wondering if there's any possibility to have a smaller deltalake "core" package that would make it more suitable for use in AWS Lambda, possibly in combination with polars. E.g., in the write_deltalake function, you have the option to choose a 'rust' engine instead of 'arrow', so maybe the (less functional) 'arrow' one could be stripped out, also leading to less dependencies. Or maybe by stripping out functionality (e.g., does anyone want to run compaction or z-ordering jobs from a lambda)?

veronewra commented 1 month ago

Maybe we could split the Delta Lake functionality into separate packages/ layers? For example, it might make sense to have a separate lambda function for deltalake operations like vacuum, optimize, FS check, etc vs a lambda function that uses operations like create and read.

tijmenr commented 1 month ago

To me, that sounds like a good idea. But it probably requires some effort, because now everything is in one big shared library.

For reference, in my use case I do not need most of what AWS SDK for Pandas / awswrangler provides, and can even do without pandas. The minimum for me then is deltalake, which requires pyarrow (that also gives me a workable table data structure), which requires numpy. But simply installing those still gives a very large layer. So instead, I:

  1. Took pyarrow (currently 16.1.0) and numpy (1.26.4) from the awswrangler 3.9.0 zip (because apparently they made some effort to decrease the sizes of those packages, that I haven't had time to look into yet)
  2. Installed deltalake (0.18.2) and pyarrow-hotfix (0.6) without deps (pip3 install --target ./python --python-version 3.12 --platform manylinux2014_x86_64 --implementation cp --only-binary ":all:" --no-deps deltalake pyarrow-hotfix)
  3. stripped the debug info from all the .so (and, .so.1601) library files in the deltalake and pyarrow packages to further decrease the size.

This gives me a layer that fits within the size limit (even when used in combination with the AWS Powertools layer). However, it's not really a nice/clean approach to put in e.g. a future-proof build pipeline.

veronewra commented 1 month ago

Thank for that example! Is the delta-rs layer just a binary that users can build using cargo in this case? Maybe we can use Rust features to let users pick what goes into the layer

tijmenr commented 1 month ago

Thank for that example! Is the delta-rs layer just a binary that users can build using cargo in this case?

The layer itself probably not:

A Lambda layer is a .zip file archive that contains supplementary code or data. Layers usually contain library dependencies, a custom runtime, or configuration files. (https://docs.aws.amazon.com/lambda/latest/dg/chapter-layers.html)

So, in the context of AWS Lambda functions implemented in Python, a layer usually contains a set of python modules, bundled together in a zip file. You probably won't directly create such a layer from cargo.

However, the Python deltalake module is built using Cargo: https://github.com/delta-io/delta-rs/tree/main/python It should be possible to use features to determine what goes into the module (at least for the Rust part). And then include this custom module into a layer. While the "backend" of the module is a shared library built in Rust, the "frontend" is regular python, which does not really have conditional compilation. How to influence the python part based on features?