awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 300 forks source link

Unable to connect to AWS from juypter Notebook after spinning up glue_libs_1.0.0_image_01 image #136

Open purnima1612 opened 2 years ago

purnima1612 commented 2 years ago
version: "2"

services:
 awsglueservice:
  image: awsglue
  build: ./spark
  container_name: awsglue-container
  hostname: localhost
  ports:
   - "8888:8888"
   - "4040:4040"
  env_file: 
   - ./env/spark-env-vars.env
  command : "/home/jupyter/jupyter_start.sh"
  volumes:
   - ~/.aws:/root/.aws:ro
   - ../app/jupyter_workspace:/home/jupyter/jupyter_default_dir

Once docker spins up and I run interactive bash shell inside the docker there on running following command aws sts get-caller-identity | cat

I can see aws user details But when I am creating juypter notebook and trying to run same command using boto3 I am getting following error :

An error was encountered: An error occurred (InvalidClientTokenId) when calling the GetCallerIdentity operation: The security token included in the request is invalid. Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (InvalidClientTokenId) when calling the GetCallerIdentity operation: The security token included in the request is invalid.

After that I tried to access S3 bucket from docker bash . It was able to connect but again from notebook getting 403 access denied error

purnima1612 commented 2 years ago

Please let me know if I need to provide more details

jfacorro commented 2 years ago

I'm having the same issue and can't seem to find the root cause.

I've checked the credentials (including the token are indeed correct) by running the following from within the same notebook session:

import boto3
ddb = boto3.client('dynamodb')
ddb.list_tables()

The above runs successfully. I suspect there is something missing when creating the GlueContext or the Spark Context, this is how it is currently done in the script where I'm finding this issue:

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue import DynamicFrame

table_name = "foo"
glue_context = GlueContext(SparkContext.getOrCreate())
source: DynamicFrame = glue_context.create_dynamic_frame.from_options(
    "dynamodb",
    connection_options={
        "dynamodb.input.tableName": table_name,
        "dynamodb.throughput.read.percent": "0.5",
        "dynamodb.splits": "5",
    },
)
source.show(3)

Evaluating the above results in the following exception:

An error was encountered:
An error occurred while calling o64.show.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.RuntimeException: Could not lookup table foo in DynamoDB.
    at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:143)
    at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
    at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
    at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
    at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
    at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
    at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
    at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:135)
    at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:604)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: The security token included in the request is invalid. (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: UnrecognizedClientException; Request ID: C4PIEL4FMAGJ4G3QNI5GUE6KB7VV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
    at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:108)
    at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
    at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:132)
    ... 22 more
Caused by: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: The security token included in the request is invalid. (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: UnrecognizedClientException; Request ID: C4PIEL4FMAGJ4G3QNI5GUE6KB7VV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
    at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6243)
    at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6210)
    at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:2256)
    at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:2220)
    at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:136)
    at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:133)
    at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
    ... 23 more
vidyadharmestry commented 1 year ago

hi, I am facing a similar issue, do you guys found any resolution for this?