DyfanJones / RAthena

Connect R to Athena using Boto3 SDK (DBI Interface)
https://dyfanjones.github.io/RAthena/
Other
35 stars 6 forks source link

Token always expired #92

Closed ozhyrenkov closed 4 years ago

ozhyrenkov commented 4 years ago

Hi! I am experiencing the issue with library when trying to assume role during the connection process. But for same user over pure boto3 with same credentials everything is working fine.

0. Working example with boto3:

session = boto3.Session(profile_name='my-profile')
credentials = session.get_credentials()
current_credentials = credentials.get_frozen_credentials()
client = boto3.client('athena',
                       aws_access_key_id = current_credentials.access_key,
                       aws_secret_access_key = current_credentials.secret_key,
                       aws_session_token = current_credentials.token)

And 3 not working examples: 1. Passing the access parameters within connection

> con <- dbConnect(RAthena::athena(), 
+                  aws_access_key_id = credentials$AccessKeyId,
+                  aws_secret_access_key = credentials$SecretAccessKey,
+                  s3_staging_dir='s3://my-bucket/tmp/',
+                  region_name='eu-west-1')
Error: An error occurred (UnrecognizedClientException) when calling the GetDatabases operation: The security token included in the request is invalid.

2. Over the profile file:

> con <- dbConnect(RAthena::athena(), 
+                  profile_name = "default",
+                  role_arn = "my-role-arn",
+                  s3_staging_dir='s3:/my-bucket/tmp/',
+                  region_name='eu-west-1')
Error in names(y$TableType) <- y$Name : 
  attempt to set an attribute on NULL

3. Without an assume role:

> con <- dbConnect(RAthena::athena(),
+                  s3_staging_dir='s3:/my-bucket/tmp/',
+                  region_name='eu-west-1')
Error in names(y$TableType) <- y$Name : 
  attempt to set an attribute on NULL

Could you please help? Am I doing something wrong?

DyfanJones commented 4 years ago

Hi sorry for the issue you having. Are you working in RStudio IDE? My initial hunch is there is a bug in passing the role to the Schema Hierarchy builder for Rstudio's connection tab.

ozhyrenkov commented 4 years ago

Thank you so much for such a quick response! Yes, the set up is like this:

Bu the way, it was working fine ~1-2 weeks ago, and not sure what was changes since that time.

DyfanJones commented 4 years ago

Are you working on the cran version or github dev? are you able to provide a traceback?

ozhyrenkov commented 4 years ago

Sure, here is the traceback:

> traceback()
15: FUN(X[[i]], ...)
14: lapply(X = X, FUN = FUN, ...)
13: sapply(x, function(y) {
        names(y$TableType) = y$Name
        y$TableType
    })
12: FUN(X[[i]], ...)
11: lapply(output, function(x) sapply(x, function(y) {
        names(y$TableType) = y$Name
        y$TableType
    }))
10: unlist(lapply(output, function(x) sapply(x, function(y) {
        names(y$TableType) = y$Name
        y$TableType
    })))
9: AthenaTableTypes(connection)
8: AthenaListObjectTypes.default(connection)
7: AthenaListObjectTypes(connection)
6: listObjectTypes()
5: observer$connectionOpened(displayName = computeDisplayName(connection), 
       type = "Athena", host = computeHostName(connection), icon = icon, 
       connectCode = paste(c("library(DBI)", "con <- dbConnect(RAthena::athena())"), 
           collapse = "\n"), disconnect = function() {
           dbDisconnect(connection)
       }, listObjectTypes = function() {
           AthenaListObjectTypes(connection)
       }, listObjects = function(...) {
           AthenaListObjects(connection, ...)
       }, listColumns = function(...) {
           AthenaListColumns(connection, ...)
       }, previewObject = function(rowLimit, ...) {
           AthenaPreviewObject(connection, rowLimit, ...)
       }, actions = AthenaConnectionActions(connection), connectionObject = connection)
4: on_connection_opened(con)
3: .local(drv, ...)
2: dbConnect(RAthena::athena(), profile_name = "default", role_arn = "role-arn", 
       s3_staging_dir = "s3://my-bucket/tmp/", 
       region_name = "eu-west-1")
1: dbConnect(RAthena::athena(), profile_name = "default", role_arn = "role-arn", 
       s3_staging_dir = "s3://my-bucket/temp/", 
       region_name = "eu-west-1")
ozhyrenkov commented 4 years ago

And have tested both versions from CRAN and this repo.

DyfanJones commented 4 years ago

Interesting, it looks like that client is created successfully but the issue is when glue is called to get your Glue Catalogue.

Has the permissions changed for that role with glue?

The connection code roughly goes like this (python version)

import boto3

sts = boto3.Session(profile_name = profile_name, region_name = "region_name").client("sts")
response = sts.assume_role(RoleArn = "role-arn",
                                               RoleSessionName = "RAthena",
                                               DurationSeconds = 3600)

creds = response["Credentials"]

session = boto3.Session(aws_access_key_id = creds["AccessKeyId"],
                                         aws_secret_access_key = creds["SecretAccessKey"],
                                         aws_session_token = creds["SessionToken"],
                                         region_name = "region_name")

glue = session.client("glue")

database = [x["Name"] for x in glue.get_databases()["DatabaseList"]]

output = [glue.get_tables(DatabaseName = x)["TableList"] for x in database]

# This is missing some of tables, but a little sleepy to perfect it :P
[[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]

Are you able to run this from python?

ozhyrenkov commented 4 years ago

It goes smoothly by the last line:

[[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-15-e7298c858d45> in <module>
----> 1 [[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]

<ipython-input-15-e7298c858d45> in <listcomp>(.0)
----> 1 [[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]

IndexError: list index out of range

The output object has all the needed things:

print(len(output))
print(output[0][0]['TableType'])
print(output[0][0]['Name'])
->100
->VIRTUAL_VIEW
->access_logs_ireland_costs
DyfanJones commented 4 years ago

Oh very interesting. Can you check for me, is there an object in output that doesn't have a TableType? And if so can you share what it is? (You can remove any sensitive info from it).

The error is happening when R is assigning NULL to a name. To prevent that from happening I need to understand what object isnt a table type and what it should be instead :)

DyfanJones commented 4 years ago

@AZhyrenkov i have created an initial work around and return "" when TableType is NULL. Can you check it out:

remotes::install_github("dyfanjones/rathena", ref = "rstudio-view")

DyfanJones commented 4 years ago

If you can find out what object in output (from the python code), I will be able to assign the correct meta data to the table instead of ""

Any help with this would be very much appreciated

ozhyrenkov commented 4 years ago

So I tried to take a look at the missing TableTypes, here is the code:

def get_params(output):
    result = {'db'    : [],
              'names' : [], 
              'types' : [] }
    for x in output:
        for y in x:
            # Append dbs
            try:
                result['db'].append(y['DatabaseName'])
            except:
                result['db'].append('')

            # Append names
            try:
                result['names'].append(y['Name'])
            except:
                result['names'].append('')
            # Append types
            try:
                result['types'].append(y['TableType'])
            except:
                result['types'].append('')
    return(result)
result = pd.DataFrame(get_params(output))

And it's showing

result[result['types'] == ''].shape
(122, 3)

Quite a lot of new tables in our Glue catalog is of type Null.

And the metainfo of table without a tabletype:

{   'CreateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal()),
    'DatabaseName': 'database',
    'IsRegisteredWithLakeFormation': False,
    'Name': 'problem_table',
    'Owner': 'email',
    'PartitionKeys': [   {'Name': 'platform_type', 'Type': 'string'},
                         {'Name': 'country_code', 'Type': 'string'},
                         {'Name': 'year', 'Type': 'string'},
                         {'Name': 'month', 'Type': 'string'},
                         {'Name': 'day', 'Type': 'string'}],
    'Retention': 0,
    'StorageDescriptor': {   'Columns': [   {columns} ],
                             'Compressed': False,
                             'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
                             'Location': 's3://bucket',
                             'NumberOfBuckets': 0,
                             'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
                             'SerdeInfo': {   'Name': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'},
                             'SortColumns': [],
                             'StoredAsSubDirectories': False},
    'UpdateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal())}
ozhyrenkov commented 4 years ago

@AZhyrenkov i have created an initial work around and return "" when TableType is NULL. Can you check it out:

remotes::install_github("dyfanjones/rathena", ref = "rstudio-view")

This is working!

DyfanJones commented 4 years ago

@AZhyrenkov perfect.

Edit: Sorry I didn't see your earlier message. Please ignore below

Can you please let me know the object types from the output where TableType is not provided. So for example:

output[0]
 [{'Name': "can be removed",
  'DatabaseName': "can be removed",
  'Owner': 'hadoop',
  'CreateTime': datetime.datetime(2020, 3, 9, 19, 31, 44, tzinfo=tzlocal()),
  'UpdateTime': datetime.datetime(2020, 3, 9, 19, 31, 44, tzinfo=tzlocal()),
  'LastAccessTime': datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzlocal()),
  'Retention': 0,
  'StorageDescriptor': {'Columns': "can be removed",
   'Location': "can be removed",
   'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
   'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
   'Compressed': False,
   'NumberOfBuckets': -1,
   'SerdeInfo': {'SerializationLibrary': 'org.apache.hive.hcatalog.data.JsonSerDe',
    'Parameters': {'serialization.format': '1'}},
   'BucketColumns': [],
   'SortColumns': [],
   'Parameters': {},
   'SkewedInfo': {'SkewedColumnNames': [],
    'SkewedColumnValues': [],
    'SkewedColumnValueLocationMaps': {}},
   'StoredAsSubDirectories': False},
  'PartitionKeys': 'can be removed',
  'Parameters': {'EXTERNAL': 'TRUE', 'transient_lastDdlTime': '1583782304'},
  'CreatedBy': 'can be removed',
  'IsRegisteredWithLakeFormation': False}]

All my unit tests have TableType return from boto3. I just want to make sure I am not missing anything.

DyfanJones commented 4 years ago

So I tried to take a look at the missing TableTypes, here is the code:

def get_params(output):
    result = {'db'    : [],
              'names' : [], 
              'types' : [] }
    for x in output:
        for y in x:
            # Append dbs
            try:
                result['db'].append(y['DatabaseName'])
            except:
                result['db'].append('')

            # Append names
            try:
                result['names'].append(y['Name'])
            except:
                result['names'].append('')
            # Append types
            try:
                result['types'].append(y['TableType'])
            except:
                result['types'].append('')
    return(result)
result = pd.DataFrame(get_params(output))

And it's showing

result[result['types'] == ''].shape
(122, 3)

Quite a lot of new tables in our Glue catalog is of type Null.

And the metainfo of table without a tabletype:

{   'CreateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal()),
    'DatabaseName': 'database',
    'IsRegisteredWithLakeFormation': False,
    'Name': 'problem_table',
    'Owner': 'email',
    'PartitionKeys': [   {'Name': 'platform_type', 'Type': 'string'},
                         {'Name': 'country_code', 'Type': 'string'},
                         {'Name': 'year', 'Type': 'string'},
                         {'Name': 'month', 'Type': 'string'},
                         {'Name': 'day', 'Type': 'string'}],
    'Retention': 0,
    'StorageDescriptor': {   'Columns': [   {columns} ],
                             'Compressed': False,
                             'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
                             'Location': 's3://bucket',
                             'NumberOfBuckets': 0,
                             'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
                             'SerdeInfo': {   'Name': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'},
                             'SortColumns': [],
                             'StoredAsSubDirectories': False},
    'UpdateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal())}

Thanks for this. I am not sure why Glue isn't storing your TableTypes. It seems like the approach I have taken is the correct one. I will merge the branch onto master. If you find any other issues please raise a ticket. I am happy to help :)

ozhyrenkov commented 4 years ago

To clarify this a bit - all the tables with this issue were produced automatically via in-house developed data-streaming engine, so it's the issue on table creation stage.

It's interesting that boto3 allows to create a table without a TableType :)

Thank you so much for the quick resolution. Are you planning to merge this in master and release to CRAN?

DyfanJones commented 4 years ago

Yeah I didn't expect boto3 to return a different syntax. To be respectful to the cran volunteers I try to push dev work once a month. So the next one will be in 6 days.

If you require it to be pushed sooner please let me know.

DyfanJones commented 4 years ago

version 1.8.0 has been pushed to the cran