Closed anwt-aws closed 2 years ago
So I've done some further debugging on this and have discovered that it seems to be "Compression Type: None" that causes the error. By setting the Compression Type to Snappy, the job completes and outputs.
I've also tried changing the GlueVersion to 2.0 which gives a slightly better error message when using Compression Type:None :
StreamingQueryException: 'An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):\n File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 2381, in _call_proxy\n return_value = getattr(self.pool[obj_id], method)(*params)\n File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 191, in call\n raise e\n File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 188, in call\n self.func(DataFrame(jdf, self.sql_ctx), batch_id)\n File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 632, in batch_function_with_persist\n batch_function(data_frame, batchId)\n File "/tmp/ETLJob.py", line 50, in processBatch\n S3bucket_node3.writeFrame(KinesisStream_node1)\n File "/opt/amazon/lib/python3.6/site-packages/awsglue/data_sink.py", line 31, in writeFrame\n return DynamicFrame(self._jsink.pyWriteDynamicFrame(dynamic_frame._jdf, callsite(), info), dynamic_frame.glue_ctx, dynamic_fra
Thank you for reporting the Glue streaming issue. We have informed the related team and the fix is in progress. You are absolutely right. For now, the workaround is to set the Compression to Snappy.
I have followed the instructions here: https://catalog.us-east-1.prod.workshops.aws/workshops/976050cc-0606-4b23-b49f-ca7b8ac4b153/en-US/300/330-streaming-lab
The Stream ETL with Glue lab, and completed up to [Create and trigger the Glue Streaming Job]. However, when I run the Glue ETL job I get:
StreamingQueryException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
No other error details are given. I've recreated the streams, jobs etc etc many times and still run into this issue. I've tried to change the Data Catalog which makes no difference.
If I set the S3 Data target properties FORMAT to JSON rather than Parquet it runs successfully and the files are generated in the S3 bucket. As soon as it is set to JSON you get this exception error and no files are created.
When looking into the Error logs in CloudWatch I get this:
This seems to be the offending line: py4j.protocol.Py4JJavaError: An error occurred while calling o148.pyWriteDynamicFrame. : java.lang.NullPointerException