DyfanJones / RAthena

Connect R to Athena using Boto3 SDK (DBI Interface)
https://dyfanjones.github.io/RAthena/
Other
35 stars 6 forks source link

Can we have `endpoint_override` as an attribute if not yet? #169

Open aoyh opened 2 years ago

aoyh commented 2 years ago

Create an attribute named endpoint_override in dbConnect

Example:

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
                 profile_name = "rathena",
                 work_group = 'mygroup',
                 endpoint_override = 'url.aws.com')
DyfanJones commented 2 years ago

Hi @aoyh, sounds like a good enhancement :) i will start working on an implementation tomorrow :)

aoyh commented 2 years ago

Thanks a lot for your prompt reply! With that enhancement, I don't have to switch to library of other language.

Another 2 good reasons:

DyfanJones commented 2 years ago

Hi @aoyh,

I have done some initial work on this feature. Please have a go and let me know if it meets the requirements 😄

remotes::install_github("dyfanjones/rathena", ref="sdk-extra-parameters")

RAthena should be able to support these extra parameters for boto3.Session and client.

Session

client

Note: endpoint_url is used instead of endpoint_override as boto3 uses that parameter name instead (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.client).

So you should be able to do the following:

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               profile_name = "rathena",
               work_group = 'mygroup',
               endpoint_url = 'url.aws.com')
aoyh commented 2 years ago

@DyfanJones So quick! Thanks. Will check and let you know.

DyfanJones commented 2 years ago

Hmmm I think my initial implementation won't work as expected 🤔

I forgot each AWS service uses it's own endpoint. Currently RAthena uses 3 services: Athena, Glue (for catalog) and S3 (for getting the data). So would we need to override each endpoint? 🤔

It looks like JDBC driver just overrides the Athena endpoint:

Default endpoint:
EndpointOverride=athena. [Region].amazonaws.com:443;

VPC endpoint:
EndpointOverride=[vpce-specific- url].athena.[Region].vpce.amazonaws.com:443;

This is a similar issue when using pythena pandas cursor:

from pyathena import connect
from pyathena.pandas.cursor import PandasCursor

cursor = connect(
    profile_name = "default",
    s3_staging_dir="s3://made-up",
    endpoint_url = "https://athena.eu-west-1.amazonaws.com",
    cursor_class=PandasCursor).cursor()

df = cursor.execute("select * from sampledb.elb_logs limit 10").as_pandas()
OperationalError: An error occurred () when calling the GetObject operation: 

To resolve this we could give options for to user in which AWS service do they want to override for example

endpoints = list(Athena = "my.amazing.endpoint")

Possibly allowing strings that only affect Athena's endpoint 🤔

DyfanJones commented 2 years ago

@aoyh Second attempt :P In this implementation you can override each aws service endpoint. To do so you will need to provide a named list of the services' endpoint you want to override. Also if you provide your endpoint only (as a character) then aws athena endpoint will be overridden. I think this gives alot of flexiblity to users when wanting to use custom endpoints :)

Please have a go and let me know if this meets requirement.

P.s. does the new documentation make sense? If not please let me know 😄

library(DBI)

con1 = dbConnect(
  RAthena::athena(), 
  endpoint_override = "https://athena.eu-west-1.amazonaws.com"
)

dbGetInfo(con1)
#> $endpoint_override
#> $endpoint_override$athena
#> [1] "https://athena.eu-west-1.amazonaws.com"
#> 
#> 
#> $region_name
#> [1] "eu-west-1"
#> 
#> $keyboard_interrupt
#> [1] TRUE
#> 
#> $timezone
#> [1] "UTC"
#> 
#> $expiration
#> NULL
#> 
#> $kms_key
#> NULL
#> 
#> $encryption_option
#> NULL
#> 
#> $poll_interval
#> NULL
#> 
#> $work_group
#> [1] "primary"
#> 
#> $dbms.name
#> [1] "default"
#> 
#> $s3_staging
#> [1] "s3://dummy"
#> 
#> $profile_name
#> NULL
#> 
#> $boto3
#> [1] "1.21.35"
#> 
#> $RAthena
#> [1] "2.5.1.9000"

con2 = dbConnect(
  RAthena::athena(), 
  endpoint_override = list(athena = "https://athena.eu-west-1.amazonaws.com")
)

dbGetInfo(con2)
#> $endpoint_override
#> $endpoint_override$athena
#> [1] "https://athena.eu-west-1.amazonaws.com"
#> 
#> 
#> $region_name
#> [1] "eu-west-1"
#> 
#> $keyboard_interrupt
#> [1] TRUE
#> 
#> $timezone
#> [1] "UTC"
#> 
#> $expiration
#> NULL
#> 
#> $kms_key
#> NULL
#> 
#> $encryption_option
#> NULL
#> 
#> $poll_interval
#> NULL
#> 
#> $work_group
#> [1] "primary"
#> 
#> $dbms.name
#> [1] "default"
#> 
#> $s3_staging
#> [1] "s3://dummy"
#> 
#> $profile_name
#> NULL
#> 
#> $boto3
#> [1] "1.21.35"
#> 
#> $RAthena
#> [1] "2.5.1.9000"

con3 = dbConnect(
  RAthena::athena(), 
  endpoint_override = list(
    Athena = "https://athena.eu-west-1.amazonaws.com",
    s3 = "https://s3.eu-west-1.amazonaws.com"
  )
)

dbGetInfo(con3)
#> $endpoint_override
#> $endpoint_override$athena
#> [1] "https://athena.eu-west-1.amazonaws.com"
#> 
#> $endpoint_override$s3
#> [1] "https://s3.eu-west-1.amazonaws.com"
#> 
#> 
#> $region_name
#> [1] "eu-west-1"
#> 
#> $keyboard_interrupt
#> [1] TRUE
#> 
#> $timezone
#> [1] "UTC"
#> 
#> $expiration
#> NULL
#> 
#> $kms_key
#> NULL
#> 
#> $encryption_option
#> NULL
#> 
#> $poll_interval
#> NULL
#> 
#> $work_group
#> [1] "primary"
#> 
#> $dbms.name
#> [1] "default"
#> 
#> $s3_staging
#> [1] "s3://dummy"
#> 
#> $profile_name
#> NULL
#> 
#> $boto3
#> [1] "1.21.35"
#> 
#> $RAthena
#> [1] "2.5.1.9000"

Created on 2022-04-20 by the reprex package (v2.0.1)

aoyh commented 2 years ago

@DyfanJones Thanks for the detailed explanation!

Since my endpoint looks like VPC endpoint: EndpointOverride=[vpce-specific- url].athena.[Region].vpce.amazonaws.com that you mentioned, then I tried the method below:

con1 = dbConnect(
  RAthena::athena(), 
  endpoint_override = "https://athena.eu-west-1.amazonaws.com"
)

dbListTables(con1) works fine, but `dbGetQuery(con1, '.SELECT ..') encountered some error:

An error occurred (AccessDeniedException) when calling the StartQueryExecution operation: User: arn:aws:iam:: .... is not authorized to perform: athena:StartQueryExecution on resource:

Any clue?

DyfanJones commented 2 years ago

dbListTables calls AWS Glue service to get the Catalog quicker than AWS Athena :) as you haven't overwritten AWS Glue endpoint that should work as normal :)

It looks like your IAM role doesn't have permission to StartQueryExecution on AWS Athena using that endpoint. Can you double check your IAM roles to ensure you are able to :)

My guess is you don't have permission to run AWS Athena in the eu-west-1 region. If you change the region to the one you use then I should expect it to work :) However on the off chance it doesn't, here the AWS Athena command called by RAthena:

The S3 and Glue services should be ok as you haven't overridden their endpoints :)

aoyh commented 2 years ago

Additional info: When using the same set of credentials, S3 location, endpoint_override url in pyathenajdbc library in Python, the query command works fine. Is there something that I missed when using RAthena?

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               aws_access_key_id='...', 
              aws_secret_access_key='...',
              s3_staging_dir='s3://...',
              region_name='...',
              work_group = 'mygroup',
              endpoint_override = 'url.aws.com')

dbGetQuery(con, '.SELECT ..') 

I use R a lot more, thus don't mind spending more time figuring it out how to make RAthena work here.

Many thanks!

DyfanJones commented 2 years ago

Sorry I might of missed somehting. Are you getting the error in R when using this endpoint "https://athena.eu-west-1.amazonaws.com"? But it works fine in pyathenajdbc? Or are you using your vpc endpoint instead? :)

aoyh commented 2 years ago

Yes, I get the error in R when using this endpoint "https://athena.eu-west-1.amazonaws.com/". But it works fine in pyathenajdbc. Good chance that we can make it work~ Thanks!

DyfanJones commented 2 years ago

Have you tried using pyathena? pyathena and RAthena both use boto3 to make the connection to AWS Athena so it should be more comparable. Let me know your results :)

DyfanJones commented 2 years ago

@aoyh did it work for pyathena? or did you get a similar error to RAthena?

DyfanJones commented 2 years ago

@aoyh any update on this?

aoyh commented 2 years ago

Hi @DyfanJones Sorry for my late reply. I spent a few days trying to get pyathena in the same environment. Till now I was not able to do it. Error:

pyathena.error.DatabaseError: An error occurred (AccessDeniedException) when calling the StartQueryExecution operation: User: arn:aws:iam::...  is not authorized to perform: athena:StartQueryExecution on resource:

However, another python package pyathenajdbc seems to be working there.

from pyathenajdbc import connect
conn = connect(S3OutputLocation=self.S3OutputLocation,
                       Schema=self.Schema,
                       EndpointOverride=self.EndpointOverride,
                       AwsRegion=self.AwsRegion,
                       Catalog=self.Catalog
                       )

Note that the EndpointOverride parameter from pyathenajdbc works fine.

Thanks!

DyfanJones commented 2 years ago

Thanks for doing that investigation @aoyh. From the looks of it pyathena and RAthena suffer the same issue regarding IAM role not having enough permissions for you, which is good in one way. RAthena is passing the endpoint url correctly to boto3.

As pyathenajdbc is using the jdbc driver. That is managed by the simba jdbc driver. Without going into the jdbc driver and finding out what is different I am not a 100% sure why boto3 isn't giving you the same results 🤔 It suggesting that your IAM role isn't suited which is confusing.

aoyh commented 2 years ago

No worries @DyfanJones . I will use a workaround by starting with pyathenajdbc python package via reticulate R package and then do the rest of data wrangling in R.

Thank you all the same!

DyfanJones commented 2 years ago

@aoyh there is an R package that uses the jdbc driver AWR.Athena. that might be able to help you out, instead of having to use pyathenajdbc to do the same thing :)

aoyh commented 2 years ago

Thanks @DyfanJones for timely tip. I was just pondering on the idea that a similar R package may exist. And you just lighten it! Will try AWR.Athena out!

DyfanJones commented 2 years ago

I will do an initial release of the endpoint_override feature. In the meantime i will have to open up the athena jdbc to see what is the difference. From my understanding i am passing the endpoint correctly however I could be mistaken.

DyfanJones commented 2 years ago

@aoyh I have done some tweaking to the implementation of the endpoint_override, plus I have added some unit tests to check if the endpoint is getting correctly passed to boto3.client. Please have a go and see if I have cracked it :)

# Enable repository from dyfanjones
options(repos = c(
    ropensci = 'https://dyfanjones.r-universe.dev',
    CRAN = "https://cloud.r-project.org"))

# Download and install RAthena in R
install.packages('RAthena')

Many thanks for the testing you have done for me so far

DyfanJones commented 2 years ago

General notes for completeness ....

It looks like awswrangler passes endpoint_url to boto3.client, similar to the method RAthena and pyathena do. Which is reassuring that RAthena is doing it correctly.

Here is the method awswrangle uses when creating the boto3.clients.

@apply_configs
def client(
    service_name: str,
    session: Optional[boto3.Session] = None,
    botocore_config: Optional[botocore.config.Config] = None,
    verify: Optional[Union[str, bool]] = None,
) -> boto3.client:
    """Create a valid boto3.client."""
    endpoint_url: Optional[str] = _get_endpoint_url(service_name=service_name)
    return ensure_session(session=session).client(
        service_name=service_name,
        endpoint_url=endpoint_url,
        use_ssl=True,
        verify=verify,
        config=default_botocore_config() if botocore_config is None else botocore_config,
    )

https://github.com/awslabs/aws-data-wrangler/blob/ef0f83ac3a2e85b0279f88259c4443520f6f3a6f/awswrangler/_utils.py#L105-L120

Note: The endpoint needs to be in the samge region i.e. region: eu-west-1, and endpoint: https://athena.eu-west-1.amazonaws.com/ . If the region doesn't match the region the you are at risk of the following aws error: Credential should be scoped to a valid region.

DyfanJones commented 2 years ago

RAthena v-2.6.0 has been released on the cran. Let me know if your still having the endpoint_override issue.