Closed ozhyrenkov closed 4 years ago
Hi sorry for the issue you having. Are you working in RStudio IDE? My initial hunch is there is a bug in passing the role to the Schema Hierarchy builder for Rstudio's connection tab.
Thank you so much for such a quick response! Yes, the set up is like this:
Bu the way, it was working fine ~1-2 weeks ago, and not sure what was changes since that time.
Are you working on the cran version or github dev? are you able to provide a traceback?
Sure, here is the traceback:
> traceback()
15: FUN(X[[i]], ...)
14: lapply(X = X, FUN = FUN, ...)
13: sapply(x, function(y) {
names(y$TableType) = y$Name
y$TableType
})
12: FUN(X[[i]], ...)
11: lapply(output, function(x) sapply(x, function(y) {
names(y$TableType) = y$Name
y$TableType
}))
10: unlist(lapply(output, function(x) sapply(x, function(y) {
names(y$TableType) = y$Name
y$TableType
})))
9: AthenaTableTypes(connection)
8: AthenaListObjectTypes.default(connection)
7: AthenaListObjectTypes(connection)
6: listObjectTypes()
5: observer$connectionOpened(displayName = computeDisplayName(connection),
type = "Athena", host = computeHostName(connection), icon = icon,
connectCode = paste(c("library(DBI)", "con <- dbConnect(RAthena::athena())"),
collapse = "\n"), disconnect = function() {
dbDisconnect(connection)
}, listObjectTypes = function() {
AthenaListObjectTypes(connection)
}, listObjects = function(...) {
AthenaListObjects(connection, ...)
}, listColumns = function(...) {
AthenaListColumns(connection, ...)
}, previewObject = function(rowLimit, ...) {
AthenaPreviewObject(connection, rowLimit, ...)
}, actions = AthenaConnectionActions(connection), connectionObject = connection)
4: on_connection_opened(con)
3: .local(drv, ...)
2: dbConnect(RAthena::athena(), profile_name = "default", role_arn = "role-arn",
s3_staging_dir = "s3://my-bucket/tmp/",
region_name = "eu-west-1")
1: dbConnect(RAthena::athena(), profile_name = "default", role_arn = "role-arn",
s3_staging_dir = "s3://my-bucket/temp/",
region_name = "eu-west-1")
And have tested both versions from CRAN and this repo.
Interesting, it looks like that client is created successfully but the issue is when glue is called to get your Glue Catalogue.
Has the permissions changed for that role with glue?
The connection code roughly goes like this (python version)
import boto3
sts = boto3.Session(profile_name = profile_name, region_name = "region_name").client("sts")
response = sts.assume_role(RoleArn = "role-arn",
RoleSessionName = "RAthena",
DurationSeconds = 3600)
creds = response["Credentials"]
session = boto3.Session(aws_access_key_id = creds["AccessKeyId"],
aws_secret_access_key = creds["SecretAccessKey"],
aws_session_token = creds["SessionToken"],
region_name = "region_name")
glue = session.client("glue")
database = [x["Name"] for x in glue.get_databases()["DatabaseList"]]
output = [glue.get_tables(DatabaseName = x)["TableList"] for x in database]
# This is missing some of tables, but a little sleepy to perfect it :P
[[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]
Are you able to run this from python?
It goes smoothly by the last line:
[[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-15-e7298c858d45> in <module>
----> 1 [[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]
<ipython-input-15-e7298c858d45> in <listcomp>(.0)
----> 1 [[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]
IndexError: list index out of range
The output
object has all the needed things:
print(len(output))
print(output[0][0]['TableType'])
print(output[0][0]['Name'])
->100
->VIRTUAL_VIEW
->access_logs_ireland_costs
Oh very interesting. Can you check for me, is there an object in output
that doesn't have a TableType? And if so can you share what it is? (You can remove any sensitive info from it).
The error is happening when R is assigning NULL to a name. To prevent that from happening I need to understand what object isnt a table type and what it should be instead :)
@AZhyrenkov i have created an initial work around and return ""
when TableType
is NULL
. Can you check it out:
remotes::install_github("dyfanjones/rathena", ref = "rstudio-view")
If you can find out what object in output
(from the python code), I will be able to assign the correct meta data to the table instead of ""
Any help with this would be very much appreciated
So I tried to take a look at the missing TableTypes
, here is the code:
def get_params(output):
result = {'db' : [],
'names' : [],
'types' : [] }
for x in output:
for y in x:
# Append dbs
try:
result['db'].append(y['DatabaseName'])
except:
result['db'].append('')
# Append names
try:
result['names'].append(y['Name'])
except:
result['names'].append('')
# Append types
try:
result['types'].append(y['TableType'])
except:
result['types'].append('')
return(result)
result = pd.DataFrame(get_params(output))
And it's showing
result[result['types'] == ''].shape
(122, 3)
Quite a lot of new tables in our Glue catalog is of type Null
.
And the metainfo of table without a tabletype:
{ 'CreateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal()),
'DatabaseName': 'database',
'IsRegisteredWithLakeFormation': False,
'Name': 'problem_table',
'Owner': 'email',
'PartitionKeys': [ {'Name': 'platform_type', 'Type': 'string'},
{'Name': 'country_code', 'Type': 'string'},
{'Name': 'year', 'Type': 'string'},
{'Name': 'month', 'Type': 'string'},
{'Name': 'day', 'Type': 'string'}],
'Retention': 0,
'StorageDescriptor': { 'Columns': [ {columns} ],
'Compressed': False,
'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
'Location': 's3://bucket',
'NumberOfBuckets': 0,
'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
'SerdeInfo': { 'Name': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'},
'SortColumns': [],
'StoredAsSubDirectories': False},
'UpdateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal())}
@AZhyrenkov i have created an initial work around and return
""
whenTableType
isNULL
. Can you check it out:
remotes::install_github("dyfanjones/rathena", ref = "rstudio-view")
This is working!
@AZhyrenkov perfect.
Edit: Sorry I didn't see your earlier message. Please ignore below
Can you please let me know the object types from the output where TableType
is not provided. So for example:
output[0]
[{'Name': "can be removed",
'DatabaseName': "can be removed",
'Owner': 'hadoop',
'CreateTime': datetime.datetime(2020, 3, 9, 19, 31, 44, tzinfo=tzlocal()),
'UpdateTime': datetime.datetime(2020, 3, 9, 19, 31, 44, tzinfo=tzlocal()),
'LastAccessTime': datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzlocal()),
'Retention': 0,
'StorageDescriptor': {'Columns': "can be removed",
'Location': "can be removed",
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'Compressed': False,
'NumberOfBuckets': -1,
'SerdeInfo': {'SerializationLibrary': 'org.apache.hive.hcatalog.data.JsonSerDe',
'Parameters': {'serialization.format': '1'}},
'BucketColumns': [],
'SortColumns': [],
'Parameters': {},
'SkewedInfo': {'SkewedColumnNames': [],
'SkewedColumnValues': [],
'SkewedColumnValueLocationMaps': {}},
'StoredAsSubDirectories': False},
'PartitionKeys': 'can be removed',
'Parameters': {'EXTERNAL': 'TRUE', 'transient_lastDdlTime': '1583782304'},
'CreatedBy': 'can be removed',
'IsRegisteredWithLakeFormation': False}]
All my unit tests have TableType return from boto3
. I just want to make sure I am not missing anything.
So I tried to take a look at the missing
TableTypes
, here is the code:def get_params(output): result = {'db' : [], 'names' : [], 'types' : [] } for x in output: for y in x: # Append dbs try: result['db'].append(y['DatabaseName']) except: result['db'].append('') # Append names try: result['names'].append(y['Name']) except: result['names'].append('') # Append types try: result['types'].append(y['TableType']) except: result['types'].append('') return(result) result = pd.DataFrame(get_params(output))
And it's showing
result[result['types'] == ''].shape (122, 3)
Quite a lot of new tables in our Glue catalog is of type
Null
.And the metainfo of table without a tabletype:
{ 'CreateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal()), 'DatabaseName': 'database', 'IsRegisteredWithLakeFormation': False, 'Name': 'problem_table', 'Owner': 'email', 'PartitionKeys': [ {'Name': 'platform_type', 'Type': 'string'}, {'Name': 'country_code', 'Type': 'string'}, {'Name': 'year', 'Type': 'string'}, {'Name': 'month', 'Type': 'string'}, {'Name': 'day', 'Type': 'string'}], 'Retention': 0, 'StorageDescriptor': { 'Columns': [ {columns} ], 'Compressed': False, 'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat', 'Location': 's3://bucket', 'NumberOfBuckets': 0, 'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat', 'SerdeInfo': { 'Name': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'}, 'SortColumns': [], 'StoredAsSubDirectories': False}, 'UpdateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal())}
Thanks for this. I am not sure why Glue isn't storing your TableTypes. It seems like the approach I have taken is the correct one. I will merge the branch onto master. If you find any other issues please raise a ticket. I am happy to help :)
To clarify this a bit - all the tables with this issue were produced automatically via in-house developed data-streaming engine, so it's the issue on table creation stage.
It's interesting that boto3 allows to create a table without a TableType :)
Thank you so much for the quick resolution. Are you planning to merge this in master and release to CRAN?
Yeah I didn't expect boto3
to return a different syntax. To be respectful to the cran volunteers I try to push dev work once a month. So the next one will be in 6 days.
If you require it to be pushed sooner please let me know.
version 1.8.0 has been pushed to the cran
Hi! I am experiencing the issue with library when trying to assume role during the connection process. But for same user over pure boto3 with same credentials everything is working fine.
0. Working example with boto3:
And 3 not working examples: 1. Passing the access parameters within connection
2. Over the profile file:
3. Without an assume role:
Could you please help? Am I doing something wrong?