awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
636 stars 300 forks source link

Glue 2.0: glue_libs_2.0.0_image_01? #81

Closed bitsofinfo closed 2 years ago

bitsofinfo commented 3 years ago

This doc references the image tag glue_libs_2.0.0_image_01... is this supposed to exist? how do I do local dev for glue 2.0?

https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/

bitsofinfo commented 3 years ago

https://github.com/awslabs/aws-glue-libs/issues/51

PawaritL commented 3 years ago

+1

svajiraya commented 3 years ago

There is only one tag available for amazon/aws-glue-libs image. i.e. glue_libs_1.0.0_image_01

https://hub.docker.com/r/amazon/aws-glue-libs/tags?page=1&ordering=last_updated

AFAIK, Glue 2.0 uses the same set of libraries as Glue 1.0. It's just that glue 2.0 uses a different mechanism to execute jobs.

As for GlueContext and DynamicFrame APIs, they are the same as Glue 1.0

niamiot commented 3 years ago

Hi, My understanding is that Glue 2.0 runs python 3.7 (instead of 3.6 / 2.7 in Glue 1.0). It would be nice to have a native docker image in v2.0

cowlike commented 3 years ago

The older version of Python in the existing image glue_libs_1.0.0_image_01 makes a difference for work I'm trying to do. Having a new image with Python 3.7 would be very helpful even if nothing else changed.

voycey commented 3 years ago

In order to do any kind of local development with Spark Streaming the newer version of Glue is needed (3.0) - yet there is still no docker image for even 2.0? Its not just a case of dropping in the new versions into the older Docker images either

mwoods-familiaris commented 2 years ago

There is a docker tag released for 2.0 now: https://hub.docker.com/layers/amazon/aws-glue-libs/glue_libs_2.0.0_image_01/images/sha256-4c66269c4373f0fc18d62d11a3c412a659ff2235660781b3c026f26597c2d5db?context=explore

Is there any clarification as to how to use this? AWS documentation all refers back to the 1.0.0 version for developing locally, and attempting to run the 2.0.0 image results in errors.

moomindani commented 2 years ago

We apologize for delay.

Docker image for Glue 3.0/2.0 is available officially. Here's the blog post for that. https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/

voycey commented 2 years ago

Ok but why are you not releasing the Dockerfile source for this?

What about the ability to include extra dependencies into this image for particular use cases? Or building a custom image / augmenting the current one based on the Glue Libs to facilitate development locally?

I really think you need to be better in touch with the people actively developing on Glue because the whole process is severely lacking and people are simply moving elsewhere

bitsofinfo commented 2 years ago

"the whole process is severely lacking and people are simply moving elsewhere"

agreed

moomindani commented 2 years ago

Ok but why are you not releasing the Dockerfile source for this?

Thank you for the feedback. We do not release Dockerfile now, but we will treat this feedback as a request for future improvement.

What about the ability to include extra dependencies into this image for particular use cases? Or building a custom image / augmenting the current one based on the Glue Libs to facilitate development locally?

As you can see in the appendix of the blog post, there are several ways to bring extra dependent libraries.

I really think you need to be better in touch with the people actively developing on Glue because the whole process is severely lacking and people are simply moving elsewhere

We apologize for lack of recent conversation.

voycey commented 2 years ago

As you can see in the appendix of the blog post, there are several ways to bring extra dependent libraries.

This makes so many assumptions as to the workflow that people are using that it is borderline comical.

Prime example is we use Python to develop on this - how do I simply add in Python dependencies? (I am NOT using Notebooks).

You could at least alleviate this by adding in a step to the Dockerfile to install requirements.txt if it exists, this would solve a lot of issues and create reasonable flexibility for people who are not using an archaic spark-submit workflow.

Release the Dockerfile and you will not have to be responsible for supporting every single use case out there and we wont have to wait an entire year for something like this to be released, people can simply add the steps they require in order to do this.

I raise these points for the other people out there who are considering starting with AWS Glue, we have already moved on because this level of support from an enterprise cloud provider is frankly laughable

moomindani commented 2 years ago

For such Python dependency installation, you can run pip3 install command inside the container. The container is running on your own host, so you have full control.

$ docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01
[glue_user@9cb0772c4a1f workspace]$ pip3 install awswrangler
voycey commented 2 years ago

I'm aware we can do this, But why would we run commands manually for this? The majority of this stuff is deployed via Infrastructure as code using immutable infrastructure techniques, have you spoken to users about how they actually run this stuff in real terms? Because I am not sure you have. The amount of times I have had to re-create the docker images, run this on multiple environments etc, makes this prohibitive

Implementing pip install -r requirements.txt if the file exists should be something simple to do and puts the control into the hands of the user.

This is a great use case: https://aws.amazon.com/quickstart/architecture/utility-meter-data-analytics-platform/

Now apply your thinking to this and how would you extend this software using the libraries that AWS provides currently.

Also keep in mind that 3.7 is a very awkward version of Python when it comes to cyclical dependencies - many things cannot be installed upon it as the libraries are already targeting 3.8 & 3.9

moomindani commented 2 years ago

Alternatively, you can create your own Dockerfile with using the Glue's official docker image as base image. I thought this will fit your use case, but please let us know if it does not fit.

And thank you for the feedback for the Python version. Currently the aws-glue-lib repo and the Docker images use the same Python runtime as the Glue job system's one. When Glue job system upgrades the Python version, then the repo and the Docker image will have newer Python version.

voycey commented 2 years ago

This also doesn't help as specific package upgrades need to be made which breaks the entire dependency chain and as that base image is Python 3.7 all other dependencies mean that package versions cannot be upgraded off a python 3.7 base image....