Open pkasson opened 4 years ago
I had the same problem. Changing the runtime to
Spark 2.4, Python 2 (Glue Version 1.0)
seemed to fix it.
I'm experiencing this same issue, and I've set the job to use Spark 2.4, Python 2 (Glue Version 1.0)
. If there are any suggestions/ tips on debugging this issue that would also be greatly appreciated.
I've noticed this only happens on the first run, when creating the databases. After encountering the error I can run the job again and it succeeds. Well... sort of. It doesn't actually process any of the logs from the "raw" table, so doesn't actually convert anything to parquet. And, without any errors, I'm not sure where it's failing.
@ryanrf-ac Hey there, thanks for chiming in - based on the error message from the original report, it looks like no data got converted in the initial job.
INFO:athena_glue_service_logs.converter:No data returned, skipping conversion.
Did you see that error message as well? Can you provide some info on how the job was configured? Specifically:
Thanks!
Hi @dacort. Thanks for getting back to me. I do see that message:
INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_optimized
null_fields []
INFO:athena_glue_service_logs.converter:No data returned, skipping conversion.
INFO:athena_glue_service_logs.job:Initial run with source NullPartitioner, adding all partitions from S3.
As for the arguments (I've removed the bucket name, but it uses the same bucket as the source location and converted target):
--converted_database_name aws_service_logs
--raw_database_name aws_service_logs
--raw_table_name s3_access_raw
--s3_converted_target s3://<bucket name>/converted/s3_access
--converted_table_name s3_access_optimized
--s3_source_location s3://<bucket name>/service=s3/
And here are some samples from the S3 access logs:
943d72306eb63a3592b82a2d25e70ce30b2bc75f63f8c23d572210847033be4e <bucket name> [11/Mar/2021:03:55:55 +0000] 76.78.94.88 arn:aws:iam::<AWS Account>:user/service-prod-common BPHSD3PPFKM071DR REST.GET.OBJECT ActivePerl/MSWin32-x64/20210220T005354Z/ActivePerl-5.28.1.0000-MSWin32-x64-719c477d.zip "GET /ActivePerl/MSWin32-x64/20210220T005354Z/ActivePerl-5.28.1.0000-MSWin32-x64-719c477d.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAQ5FYQM547I2EFPRW%2F20210311%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210311T035555Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 200 - 46616371 46616371 11885 69 "-" "Go-http-client/1.1" - HZWnnqunXit1g72dvo9WEL/xLoa80jhFfaAo1P9SzDLXglPOERELpdLAB4nWeTn+2+PcNy9OT24= SigV4 ECDHE-RSA-AES128-GCM-SHA256 QueryString <bucket name>.s3.amazonaws.com TLSv1.2
943d72306eb63a3592b82a2d25e70ce30b2bc75f63f8c23d572210847033be4e <bucket name> [19/Mar/2021:21:59:55 +0000] 129.6.14.8 arn:aws:iam::<AWS Account>:user/service-prod-common BHA83P3NAZTRFXB3 REST.GET.OBJECT ActiveTcl/MSWin32-x64/20201021T182615Z/ActiveTcl-8.6.10.0000-MSWin32-x64-90f8b13d.exe "GET /ActiveTcl/MSWin32-x64/20201021T182615Z/ActiveTcl-8.6.10.0000-MSWin32-x64-90f8b13d.exe?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAQ5FYQM547I2EFPRW%2F20210319%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210319T215952Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 200 - 16341112 16341112 274 137 "https://platform.activestate.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" - FP2TCU89nN2//Mos3f7wVVEh3+8VqxQCLZyZRwnVKb9UU2Sc+fFLLIiIM39VumSgMQn57IZuMNo= SigV4 ECDHE-RSA-AES128-GCM-SHA256 QueryString <bucket name>.s3.amazonaws.com TLSv1.2
943d72306eb63a3592b82a2d25e70ce30b2bc75f63f8c23d572210847033be4e <bucket name> [29/Mar/2021:04:06:22 +0000] 218.92.226.69 - XJCC355ZR6BQJ64H REST.GET.OBJECT ActivePerl/MSWin32-x64/20200528T195345Z/ActivePerl-5.28.1.0000-MSWin32-x64-b462fde1.exe "GET /ActivePerl/MSWin32-x64/20200528T195345Z/ActivePerl-5.28.1.0000-MSWin32-x64-b462fde1.exe?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJAFTYUXEZJ3HWLEQ/20200702/us-east-1/s3/aws4_request&X-Amz-Date=20200702T021019Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 403 AccessDenied 369 - 3 - "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" - xWvyO2/BG1xpUiuhZqy4D44LX5bfuQF8ettxE1kl4jHz64h4vNmvsaCUlabKVOK414nwpT6DrZY= SigV4 ECDHE-RSA-AES128-SHA QueryString <bucket name>.s3.amazonaws.com TLSv1.2
943d72306eb63a3592b82a2d25e70ce30b2bc75f63f8c23d572210847033be4e <bucket name> [12/Apr/2021:15:52:47 +0000] 35.227.107.181 arn:aws:iam::<AWS Account>:user/service-prod-common RGHSVM64E6SF8H3B REST.GET.OBJECT ActivePerl/MSWin32-x64/20210220T005354Z/ActivePerl-5.28.1.0000-MSWin32-x64-719c477d.zip "GET /ActivePerl/MSWin32-x64/20210220T005354Z/ActivePerl-5.28.1.0000-MSWin32-x64-719c477d.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAQ5FYQM547I2EFPRW%2F20210412%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210412T155247Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 200 - 46616371 46616371 467 30 "-" "Go-http-client/1.1" - djl+Ff4NgHP9yPhx28Tm2pwGy7EL1r6+W0rPAsbNhTJhqCWy1Gd1zzoWZUPgA1J78wn4j2ZtM64= SigV4 ECDHE-RSA-AES128-GCM-SHA256 QueryString <bucket name>.s3.amazonaws.com TLSv1.2
@ryanrf-ac Hm, ok everything looks fine there. That message indicates it's not seeing any data in the source table. A couple followup questions:
s3://<bucket name>/service=s3/
? Or is there another prefix that they might be in?s3_access_raw
table in Athena? Do any results get returned?@dacort The access logs are using another prefix s3://<bucket name>/service=s3/bucket=<bucket accessed>/06-05-2021-alongstring
But I did test with specifying a more specific path (e.g. s3://<bucket name>/service=s3/bucket=<bucket accessed>
as the --s3_source_location
.
Yes, I can query the s3_access_raw
table in Athena. But the table isn't like other tables I've created using crawlers. I've tried using the DynamicFrame from_catalog
method from a Glue ETL job on that table and can't read any values from the table.
Using latest from this repo ...
The databases get created, but just after the access_optimized table is created, we receive this error:
glue_service_logs/utils.py", line 128, in _get_first_key_in_prefix TypeError: 'NoneType' object has no attribute '__getitem__' End of LogType:stdout
Full trace:
`20/07/21 09:47:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 20/07/21 09:47:30 INFO MemoryStore: MemoryStore cleared 20/07/21 09:47:30 INFO BlockManager: BlockManager stopped 20/07/21 09:47:30 INFO BlockManagerMaster: BlockManagerMaster stopped 20/07/21 09:47:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 20/07/21 09:47:30 INFO SparkContext: Successfully stopped SparkContext 20/07/21 09:47:30 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User application exited with status 1) 20/07/21 09:47:30 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 20/07/21 09:47:30 INFO ApplicationMaster: Deleting staging directory hdfs://ip-172-32-99-27.ec2.internal:8020/user/root/.sparkStaging/application_1595322435097_0001 20/07/21 09:47:30 INFO ShutdownHookManager: Shutdown hook called 20/07/21 09:47:30 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/root/appcache/application_1595322435097_0001/spark-1ca0f7e1-0bc4-47d6-ab91-12879d18055f 20/07/21 09:47:30 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/root/appcache/application_1595322435097_0001/spark-1ca0f7e1-0bc4-47d6-ab91-12879d18055f/pyspark-11265b43-46c7-46a1-a043-8f182446bdf0 End of LogType:stderr
LogType:stdout Log Upload Time:Tue Jul 21 09:47:32 +0000 2020 LogLength:2765 Log Contents: INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254 INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254 INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com INFO:athena_glue_service_logs.job:Initial run, scanning S3 for partitions. INFO:athena_glue_service_logs.catalog_manager:Creating database aws_service_logs INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_raw INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_optimized null_fields [] INFO:athena_glue_service_logs.converter:No data returned, skipping conversion. INFO:athena_glue_service_logs.job:Initial run with source NullPartitioner, adding all partitions from S3. INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): rodin-glue.s3.amazonaws.com Parse yarn logs get error message: TypeError: 'NoneType' object has no attribute 'getitem' Traceback (most recent call last): File "script_2020-07-21-09-45-54.py", line 4, in
job_run.convert_and_partition()
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/job.py", line 156, in convert_and_partition
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/job.py", line 137, in add_new_optimized_partitions
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/catalog_manager.py", line 89, in get_and_create_partitions
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/partitioners/date_partitioner.py", line 34, in build_partitions_from_s3
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/utils.py", line 80, in get_first_hivecompatible_date_in_prefix
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converterlatest.zip/athena`