Open aoyh opened 2 years ago
Hi @aoyh, sounds like a good enhancement :) i will start working on an implementation tomorrow :)
Thanks a lot for your prompt reply! With that enhancement, I don't have to switch to library of other language.
Another 2 good reasons:
EndpointOverride
is usually a must for security reason in some projects.EndpointOverride
is supported in AthenaJDBCConnector
.Hi @aoyh,
I have done some initial work on this feature. Please have a go and let me know if it meets the requirements 😄
remotes::install_github("dyfanjones/rathena", ref="sdk-extra-parameters")
RAthena
should be able to support these extra parameters for boto3.Session
and client
.
Session
client
Note: endpoint_url
is used instead of endpoint_override
as boto3
uses that parameter name instead (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.client).
So you should be able to do the following:
library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
profile_name = "rathena",
work_group = 'mygroup',
endpoint_url = 'url.aws.com')
@DyfanJones So quick! Thanks. Will check and let you know.
Hmmm I think my initial implementation won't work as expected 🤔
I forgot each AWS service uses it's own endpoint. Currently RAthena
uses 3 services: Athena, Glue (for catalog) and S3 (for getting the data). So would we need to override each endpoint? 🤔
It looks like JDBC driver just overrides the Athena endpoint:
Default endpoint:
EndpointOverride=athena. [Region].amazonaws.com:443;
VPC endpoint:
EndpointOverride=[vpce-specific- url].athena.[Region].vpce.amazonaws.com:443;
This is a similar issue when using pythena
pandas cursor:
from pyathena import connect
from pyathena.pandas.cursor import PandasCursor
cursor = connect(
profile_name = "default",
s3_staging_dir="s3://made-up",
endpoint_url = "https://athena.eu-west-1.amazonaws.com",
cursor_class=PandasCursor).cursor()
df = cursor.execute("select * from sampledb.elb_logs limit 10").as_pandas()
OperationalError: An error occurred () when calling the GetObject operation:
To resolve this we could give options for to user in which AWS service do they want to override for example
endpoints = list(Athena = "my.amazing.endpoint")
Possibly allowing strings that only affect Athena's endpoint 🤔
@aoyh Second attempt :P In this implementation you can override each aws service endpoint. To do so you will need to provide a named list of the services' endpoint you want to override. Also if you provide your endpoint only (as a character) then aws athena endpoint will be overridden. I think this gives alot of flexiblity to users when wanting to use custom endpoints :)
Please have a go and let me know if this meets requirement.
P.s. does the new documentation make sense? If not please let me know 😄
library(DBI)
con1 = dbConnect(
RAthena::athena(),
endpoint_override = "https://athena.eu-west-1.amazonaws.com"
)
dbGetInfo(con1)
#> $endpoint_override
#> $endpoint_override$athena
#> [1] "https://athena.eu-west-1.amazonaws.com"
#>
#>
#> $region_name
#> [1] "eu-west-1"
#>
#> $keyboard_interrupt
#> [1] TRUE
#>
#> $timezone
#> [1] "UTC"
#>
#> $expiration
#> NULL
#>
#> $kms_key
#> NULL
#>
#> $encryption_option
#> NULL
#>
#> $poll_interval
#> NULL
#>
#> $work_group
#> [1] "primary"
#>
#> $dbms.name
#> [1] "default"
#>
#> $s3_staging
#> [1] "s3://dummy"
#>
#> $profile_name
#> NULL
#>
#> $boto3
#> [1] "1.21.35"
#>
#> $RAthena
#> [1] "2.5.1.9000"
con2 = dbConnect(
RAthena::athena(),
endpoint_override = list(athena = "https://athena.eu-west-1.amazonaws.com")
)
dbGetInfo(con2)
#> $endpoint_override
#> $endpoint_override$athena
#> [1] "https://athena.eu-west-1.amazonaws.com"
#>
#>
#> $region_name
#> [1] "eu-west-1"
#>
#> $keyboard_interrupt
#> [1] TRUE
#>
#> $timezone
#> [1] "UTC"
#>
#> $expiration
#> NULL
#>
#> $kms_key
#> NULL
#>
#> $encryption_option
#> NULL
#>
#> $poll_interval
#> NULL
#>
#> $work_group
#> [1] "primary"
#>
#> $dbms.name
#> [1] "default"
#>
#> $s3_staging
#> [1] "s3://dummy"
#>
#> $profile_name
#> NULL
#>
#> $boto3
#> [1] "1.21.35"
#>
#> $RAthena
#> [1] "2.5.1.9000"
con3 = dbConnect(
RAthena::athena(),
endpoint_override = list(
Athena = "https://athena.eu-west-1.amazonaws.com",
s3 = "https://s3.eu-west-1.amazonaws.com"
)
)
dbGetInfo(con3)
#> $endpoint_override
#> $endpoint_override$athena
#> [1] "https://athena.eu-west-1.amazonaws.com"
#>
#> $endpoint_override$s3
#> [1] "https://s3.eu-west-1.amazonaws.com"
#>
#>
#> $region_name
#> [1] "eu-west-1"
#>
#> $keyboard_interrupt
#> [1] TRUE
#>
#> $timezone
#> [1] "UTC"
#>
#> $expiration
#> NULL
#>
#> $kms_key
#> NULL
#>
#> $encryption_option
#> NULL
#>
#> $poll_interval
#> NULL
#>
#> $work_group
#> [1] "primary"
#>
#> $dbms.name
#> [1] "default"
#>
#> $s3_staging
#> [1] "s3://dummy"
#>
#> $profile_name
#> NULL
#>
#> $boto3
#> [1] "1.21.35"
#>
#> $RAthena
#> [1] "2.5.1.9000"
Created on 2022-04-20 by the reprex package (v2.0.1)
@DyfanJones Thanks for the detailed explanation!
Since my endpoint looks like VPC endpoint: EndpointOverride=[vpce-specific- url].athena.[Region].vpce.amazonaws.com
that you mentioned, then I tried the method below:
con1 = dbConnect(
RAthena::athena(),
endpoint_override = "https://athena.eu-west-1.amazonaws.com"
)
dbListTables(con1)
works fine, but `dbGetQuery(con1, '.SELECT ..') encountered some error:
An error occurred (AccessDeniedException) when calling the StartQueryExecution operation: User: arn:aws:iam:: .... is not authorized to perform: athena:StartQueryExecution on resource:
Any clue?
dbListTables
calls AWS Glue service to get the Catalog quicker than AWS Athena :) as you haven't overwritten AWS Glue endpoint that should work as normal :)
It looks like your IAM role doesn't have permission to StartQueryExecution on AWS Athena using that endpoint. Can you double check your IAM roles to ensure you are able to :)
My guess is you don't have permission to run AWS Athena in the eu-west-1
region. If you change the region to the one you use then I should expect it to work :) However on the off chance it doesn't, here the AWS Athena command called by RAthena:
The S3 and Glue services should be ok as you haven't overridden their endpoints :)
Additional info:
When using the same set of credentials, S3 location, endpoint_override url in pyathenajdbc
library in Python, the query command works fine.
Is there something that I missed when using RAthena
?
library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
aws_access_key_id='...',
aws_secret_access_key='...',
s3_staging_dir='s3://...',
region_name='...',
work_group = 'mygroup',
endpoint_override = 'url.aws.com')
dbGetQuery(con, '.SELECT ..')
I use R a lot more, thus don't mind spending more time figuring it out how to make RAthena work here.
Many thanks!
Sorry I might of missed somehting. Are you getting the error in R
when using this endpoint "https://athena.eu-west-1.amazonaws.com"? But it works fine in pyathenajdbc? Or are you using your vpc endpoint instead? :)
Yes, I get the error in R
when using this endpoint "https://athena.eu-west-1.amazonaws.com/". But it works fine in pyathenajdbc
.
Good chance that we can make it work~ Thanks!
Have you tried using pyathena
? pyathena
and RAthena
both use boto3
to make the connection to AWS Athena so it should be more comparable. Let me know your results :)
@aoyh did it work for pyathena? or did you get a similar error to RAthena?
@aoyh any update on this?
Hi @DyfanJones Sorry for my late reply. I spent a few days trying to get pyathena
in the same environment. Till now I was not able to do it.
Error:
pyathena.error.DatabaseError: An error occurred (AccessDeniedException) when calling the StartQueryExecution operation: User: arn:aws:iam::... is not authorized to perform: athena:StartQueryExecution on resource:
However, another python package pyathenajdbc seems to be working there.
from pyathenajdbc import connect
conn = connect(S3OutputLocation=self.S3OutputLocation,
Schema=self.Schema,
EndpointOverride=self.EndpointOverride,
AwsRegion=self.AwsRegion,
Catalog=self.Catalog
)
Note that the EndpointOverride
parameter from pyathenajdbc
works fine.
Thanks!
Thanks for doing that investigation @aoyh. From the looks of it pyathena
and RAthena
suffer the same issue regarding IAM role not having enough permissions for you, which is good in one way. RAthena
is passing the endpoint url correctly to boto3
.
As pyathenajdbc
is using the jdbc
driver. That is managed by the simba jdbc driver. Without going into the jdbc driver and finding out what is different I am not a 100% sure why boto3 isn't giving you the same results 🤔 It suggesting that your IAM role isn't suited which is confusing.
No worries @DyfanJones . I will use a workaround by starting with pyathenajdbc
python package via reticulate
R package and then do the rest of data wrangling in R.
Thank you all the same!
@aoyh there is an R package that uses the jdbc driver AWR.Athena
. that might be able to help you out, instead of having to use pyathenajdbc to do the same thing :)
Thanks @DyfanJones for timely tip. I was just pondering on the idea that a similar R package may exist. And you just lighten it!
Will try AWR.Athena
out!
I will do an initial release of the endpoint_override feature. In the meantime i will have to open up the athena jdbc to see what is the difference. From my understanding i am passing the endpoint correctly however I could be mistaken.
@aoyh I have done some tweaking to the implementation of the endpoint_override
, plus I have added some unit tests to check if the endpoint
is getting correctly passed to boto3.client
. Please have a go and see if I have cracked it :)
# Enable repository from dyfanjones
options(repos = c(
ropensci = 'https://dyfanjones.r-universe.dev',
CRAN = "https://cloud.r-project.org"))
# Download and install RAthena in R
install.packages('RAthena')
Many thanks for the testing you have done for me so far
General notes for completeness ....
It looks like awswrangler
passes endpoint_url
to boto3.client
, similar to the method RAthena
and pyathena
do. Which is reassuring that RAthena
is doing it correctly.
Here is the method awswrangle
uses when creating the boto3.clients
.
@apply_configs
def client(
service_name: str,
session: Optional[boto3.Session] = None,
botocore_config: Optional[botocore.config.Config] = None,
verify: Optional[Union[str, bool]] = None,
) -> boto3.client:
"""Create a valid boto3.client."""
endpoint_url: Optional[str] = _get_endpoint_url(service_name=service_name)
return ensure_session(session=session).client(
service_name=service_name,
endpoint_url=endpoint_url,
use_ssl=True,
verify=verify,
config=default_botocore_config() if botocore_config is None else botocore_config,
)
Note: The endpoint needs to be in the samge region i.e. region: eu-west-1, and endpoint: https://athena.eu-west-1.amazonaws.com/ . If the region doesn't match the region the you are at risk of the following aws error: Credential should be scoped to a valid region
.
RAthena v-2.6.0 has been released on the cran. Let me know if your still having the endpoint_override issue.
Create an attribute named
endpoint_override
in dbConnectExample: