aws-samples / aws-concurrent-data-orchestration-pipeline-emr-livy

This code demonstrates the architecture featured on the AWS Big Data blog (https://aws.amazon.com/blogs/big-data/ ) which creates a concurrent data pipeline by using Amazon EMR and Apache Livy. This pipeline is orchestrated by Apache Airflow.
Apache License 2.0
76 stars 33 forks source link

Http connection issue #6

Open maheshdrago opened 2 years ago

maheshdrago commented 2 years ago

Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen httplib_response = self._make_request( File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/http/client.py", line 1256, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/http/client.py", line 1302, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/http/client.py", line 1251, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/http/client.py", line 1011, in _send_output self.send(msg) File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/http/client.py", line 951, in send self.connect() File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/site-packages/urllib3/connection.py", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f33068e10d0>: Failed to establish a new connection: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/site-packages/requests/adapters.py", line 440, in send resp = conn.urlopen( File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen retries = retries.increment( File "/home/ubuntu/anaconda3/envs/venv/lib/python3.8/site-packages/urllib3/util/retry.py", line 574, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='ec2-23-20-80-1.compute-1.amazonaws.com', port=8998): Max retries exceeded with url: /sessions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f33068e10d0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

i get this error after submitting the spark job. Can anyone help me out?

ewengillies commented 2 years ago

Hi! Judging from the error message, your code is failing here:

https://github.com/aws-samples/aws-concurrent-data-orchestration-pipeline-emr-livy/blob/master/dags/airflowlib/emr_lib.py#L74

This is where Airflow calls Livy to create a Spark session on the spark cluster. It's a networking issue.

This is probably because:

Check the security groups on those assets first.

I used this tutorial three years ago to get comfortable with EMR + Airflow, but in general, I would advise moving from using boto3 and python in airflow to using the EMR operators in Airflow directly. Calling all the boto3 like in this tutorial doesn't let you leverage the full scope of Airflow, but using the operators will.

maheshdrago commented 2 years ago

Hi , thanks for replying!

So I do have Livy port attached to the security group.

Screenshot (29)

Also when I try to open the Livy UI on port 8998 with the same DNS it does open there. But fails to from airflow.

Screenshot (30)

Do I need to toggle my Livy connection, cause I only have the default connection in it. I think it has something to do with it. The connection is below:

Screenshot (33)

my host is currently set to livy, is that causing the issue? And if yes, how can i access the right host id as new EMR cluster gets created.