awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
647 stars 304 forks source link

Container version amazon/aws-glue-libs:glue_libs_3.0.0_image_01 dosent work #120

Open chauhansachinkr opened 2 years ago

chauhansachinkr commented 2 years ago

Tried different ways but not able to get the jupyter woking with the image glue_libs_3.0.0_image_01 , based on the original blog. https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/

Question: Can you publish short instructions how to start docker image with correct params so that glue env works?

-rwxr-xr-x 1 glue_user root 664 Dec 20 16:57 jupyter_start.sh [glue_user@101d6d195144 jupyter]$ ./jupyter_start.sh starting java -cp /home/glue_user/livy/jars/*:/home/glue_user/livy/conf:/home/glue_user/spark/conf:/home/glue_user/spark/conf: org.apache.livy.server.LivyServer, logging to /home/glue_user/livy/logs/livy--server.out Starting Jupyter with SSL [I 2022-01-10 12:57:18.120 ServerApp] jupyterlab | extension was successfully linked. [I 2022-01-10 12:57:18.134 ServerApp] Writing Jupyter server cookie secret to /home/glue_user/.local/share/jupyter/runtime/jupyter_cookie_secret [I 2022-01-10 12:57:18.436 ServerApp] nbclassic | extension was successfully linked. [W 2022-01-10 12:57:18.474 ServerApp] All authentication is disabled. Anyone who can connect to this server will be able to run code. [I 2022-01-10 12:57:18.485 ServerApp] nbclassic | extension was successfully loaded. [I 2022-01-10 12:57:18.488 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.7/site-packages/jupyterlab [I 2022-01-10 12:57:18.488 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab [I 2022-01-10 12:57:18.500 ServerApp] jupyterlab | extension was successfully loaded. [I 2022-01-10 12:57:18.503 ServerApp] Serving notebooks from local directory: /home/glue_user/workspace/jupyter_workspace [I 2022-01-10 12:57:18.503 ServerApp] Jupyter Server 1.13.1 is running at: [I 2022-01-10 12:57:18.503 ServerApp] https://101d6d195144:8888/lab [I 2022-01-10 12:57:18.504 ServerApp] or https://127.0.0.1:8888/lab [I 2022-01-10 12:57:18.512 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). Exception in callback BaseAsyncIOLoop._handle_events(7, 1) handle: <Handle BaseAsyncIOLoop._handle_events(7, 1)> Traceback (most recent call last): File "/usr/lib64/python3.7/asyncio/events.py", line 88, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib64/python3.7/site-packages/tornado/platform/asyncio.py", line 189, in _handle_events handler_func(fileobj, events) File "/usr/local/lib64/python3.7/site-packages/tornado/netutil.py", line 276, in accept_handler callback(connection, address) File "/usr/local/lib64/python3.7/site-packages/tornado/tcpserver.py", line 292, in _handle_connection do_handshake_on_connect=False, File "/usr/local/lib64/python3.7/site-packages/tornado/netutil.py", line 608, in ssl_wrap_socket context = ssl_options_to_context(ssl_options) File "/usr/local/lib64/python3.7/site-packages/tornado/netutil.py", line 577, in ssl_options_to_context ssl_options["certfile"], ssl_options.get("keyfile", None) ssl.SSLError: [SSL] PEM lib (_ssl.c:3911) Exception in callback BaseAsyncIOLoop._handle_events(7, 1) handle: <Handle BaseAsyncIOLoop._handle_events(7, 1)> Traceback (most recent call last):

joates-madetech commented 2 years ago

Is this the right repo to report issues with the docker images? because there is also a problem with the 1.0.0 image which was updated yesterday.

svajiraya commented 2 years ago

@joates-madetech can you please open a new GitHub issue in this repository and provide some information about the issue you are facing with 1.0.0 image?

vmussa commented 2 years ago

I have also tried to run and start a glue_libs_3.0.0_image_01 based container, just like the original blog post tells us to. The container exits right after I do the docker run. Tried different forms, but can't keep the container alive for the docker exec step.

goldengrisha commented 2 years ago

please follow https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/

  1. docker run -itd -p 8888:8888 -p 4040:4040 --env-file /datalab_pocs/glue_local/env_variables.txt --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/jupyter/jupyter_start.sh
  2. docker run -itd -p 8888:8888 -p 4040:4040 -e AWS_ACCESS_KEY_ID=<ID> -e AWS_SECRET_ACCESS_KEY=<Key> -e AWS_REGION=<Region> --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/jupyter/jupyter_start.sh
  3. docker run -itd -p 8888:8888 -p 4040:4040 -v ~/.aws:/root/.aws:ro -v C:\Users\admin\Documents\notebooks:/home/jupyter/jupyter_default_dir --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/jupyter/jupyter_start.sh Also, check logs.
vmussa commented 2 years ago

@goldengrisha this is the blogpost the OP and I mentioned we were following. None of these methods you pasted here work for 3.0.0 glue image in my environment, only for 1.0.0. Will check the logs when I have more time though. Thanks anyway.

goldengrisha commented 2 years ago

@vmussa in my case I had 2 issues:

  1. DISABLE_SSL="true"
  2. /home/jupyter/jupyter_start.sh This works for me:
    docker run -it -p 8888:8888 -p 4040:4040 -e DISABLE_SSL="true" \
    --env-file ~/Projects/glue_config/env_variables.txt \
    --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01

    and then inside the container, just execute: /home/glue_user/jupyter/jupyter_start.sh

vmussa commented 2 years ago

@goldengrisha thank you very much for the hints. In fact, my problem was the absence of the setting of this variable: DISABLE_SSL="true". I guess you figured this out by looking at the logs. Thanks for sharing, in my case it works perfectly.

goldengrisha commented 2 years ago

@vmussa, yeah, you're welcome.

voycey commented 2 years ago

Why not just publish the Dockerfile for these? The developer ecosystem around this is a mess and you are not making it easy for people to use it.

babaMar commented 1 year ago

@svajiraya is there a way to run the container without specifying any command? My use case is to submit jobs from a python script, so to start with I would need to be able to create a client with boto3 and use the create_connection , get_connection, and update_connection , get_job, and update_job.

I can run pyspark for example, but if I try for example aws glue get-jobs --no-verify-ssl --endpoint-url http://localhost:4040

I get

<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>                       
<title>Error 405 Method Not Allowed</title>
</head>
<body><h2>HTTP ERROR 405 Method Not Allowed</h2>
<table>
<tr><th>URI:</th><td>/</td></tr>
<tr><th>STATUS:</th><td>405</td></tr>
<tr><th>MESSAGE:</th><td>Method Not Allowed</td></tr>
<tr><th>SERVLET:</th><td>org.apache.spark.ui.JettyUtils$$anon$2-79068023</td></tr>
</table>
<hr><a href="https://eclipse.org/jetty">Powered by Jetty:// 9.4.37.v20210219</a><hr/>

</body>
</html>

Similar result if I use the 18080 port.