Closed bitsofinfo closed 2 years ago
+1
There is only one tag available for amazon/aws-glue-libs
image. i.e. glue_libs_1.0.0_image_01
https://hub.docker.com/r/amazon/aws-glue-libs/tags?page=1&ordering=last_updated
AFAIK, Glue 2.0 uses the same set of libraries as Glue 1.0. It's just that glue 2.0 uses a different mechanism to execute jobs.
As for GlueContext
and DynamicFrame
APIs, they are the same as Glue 1.0
Hi, My understanding is that Glue 2.0 runs python 3.7 (instead of 3.6 / 2.7 in Glue 1.0). It would be nice to have a native docker image in v2.0
The older version of Python in the existing image glue_libs_1.0.0_image_01
makes a difference for work I'm trying to do. Having a new image with Python 3.7 would be very helpful even if nothing else changed.
In order to do any kind of local development with Spark Streaming the newer version of Glue is needed (3.0) - yet there is still no docker image for even 2.0? Its not just a case of dropping in the new versions into the older Docker images either
There is a docker tag released for 2.0 now: https://hub.docker.com/layers/amazon/aws-glue-libs/glue_libs_2.0.0_image_01/images/sha256-4c66269c4373f0fc18d62d11a3c412a659ff2235660781b3c026f26597c2d5db?context=explore
Is there any clarification as to how to use this? AWS documentation all refers back to the 1.0.0 version for developing locally, and attempting to run the 2.0.0 image results in errors.
We apologize for delay.
Docker image for Glue 3.0/2.0 is available officially. Here's the blog post for that. https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/
Ok but why are you not releasing the Dockerfile source for this?
What about the ability to include extra dependencies into this image for particular use cases? Or building a custom image / augmenting the current one based on the Glue Libs to facilitate development locally?
I really think you need to be better in touch with the people actively developing on Glue because the whole process is severely lacking and people are simply moving elsewhere
"the whole process is severely lacking and people are simply moving elsewhere"
agreed
Ok but why are you not releasing the Dockerfile source for this?
Thank you for the feedback. We do not release Dockerfile now, but we will treat this feedback as a request for future improvement.
What about the ability to include extra dependencies into this image for particular use cases? Or building a custom image / augmenting the current one based on the Glue Libs to facilitate development locally?
As you can see in the appendix of the blog post, there are several ways to bring extra dependent libraries.
I really think you need to be better in touch with the people actively developing on Glue because the whole process is severely lacking and people are simply moving elsewhere
We apologize for lack of recent conversation.
As you can see in the appendix of the blog post, there are several ways to bring extra dependent libraries.
This makes so many assumptions as to the workflow that people are using that it is borderline comical.
Prime example is we use Python to develop on this - how do I simply add in Python dependencies? (I am NOT using Notebooks).
You could at least alleviate this by adding in a step to the Dockerfile to install requirements.txt
if it exists, this would solve a lot of issues and create reasonable flexibility for people who are not using an archaic spark-submit
workflow.
Release the Dockerfile and you will not have to be responsible for supporting every single use case out there and we wont have to wait an entire year for something like this to be released, people can simply add the steps they require in order to do this.
I raise these points for the other people out there who are considering starting with AWS Glue, we have already moved on because this level of support from an enterprise cloud provider is frankly laughable
For such Python dependency installation, you can run pip3 install
command inside the container.
The container is running on your own host, so you have full control.
$ docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01
[glue_user@9cb0772c4a1f workspace]$ pip3 install awswrangler
I'm aware we can do this, But why would we run commands manually for this? The majority of this stuff is deployed via Infrastructure as code using immutable infrastructure techniques, have you spoken to users about how they actually run this stuff in real terms? Because I am not sure you have. The amount of times I have had to re-create the docker images, run this on multiple environments etc, makes this prohibitive
Implementing pip install -r requirements.txt
if the file exists should be something simple to do and puts the control into the hands of the user.
This is a great use case: https://aws.amazon.com/quickstart/architecture/utility-meter-data-analytics-platform/
Now apply your thinking to this and how would you extend this software using the libraries that AWS provides currently.
Also keep in mind that 3.7 is a very awkward version of Python when it comes to cyclical dependencies - many things cannot be installed upon it as the libraries are already targeting 3.8 & 3.9
Alternatively, you can create your own Dockerfile with using the Glue's official docker image as base image. I thought this will fit your use case, but please let us know if it does not fit.
And thank you for the feedback for the Python version. Currently the aws-glue-lib repo and the Docker images use the same Python runtime as the Glue job system's one. When Glue job system upgrades the Python version, then the repo and the Docker image will have newer Python version.
This also doesn't help as specific package upgrades need to be made which breaks the entire dependency chain and as that base image is Python 3.7 all other dependencies mean that package versions cannot be upgraded off a python 3.7 base image....
This doc references the image tag
glue_libs_2.0.0_image_01
... is this supposed to exist? how do I do local dev for glue 2.0?https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/