DyfanJones / noctua

Connect R to Athena using paws SDK (DBI Interface)
https://dyfanjones.github.io/noctua/
Other
45 stars 5 forks source link

dbConnect and noctua::athena() only pulls in default profile #90

Closed mraess closed 4 years ago

mraess commented 4 years ago

Issue Description

r noctua::athena() does connect to database BUT only to the database with the default profile in the aws .config/.credentials files.

When I run this

con <- dbConnect(noctua::athena(),
                 profile_name = "default",
                 s3_staging_dir = "s3://aws-athena-query-results-for-default-profile-us-east-1") 

or

con <- dbConnect(noctua::athena(),
                 s3_staging_dir = "s3://aws-athena-query-results-for-default-profile-us-east-1") 

It does connect to the correct AWS database under the default profile in the aws .config/.credentials file and I'm able to run queries against it.

Output from ```dbGetInfo(con)``` > dbGetInfo(con) $profile_name [1] "default" $s3_staging [1] "s3://aws-athena-query-results-for-default-profile-us-east-1" $dbms.name [1] "default" $work_group [1] "primary" $poll_interval NULL $encryption_option NULL $kms_key NULL $expiration NULL $region_name [1] "us-east-1" $paws [1] "0.1.7" $noctua [1] "1.6.0"

However, when I run this,


con_2 <- dbConnect(noctua::athena(),
                   profile_name = "other-profile-that-is-not-default",
                   s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/", 
                   region = "us-east-1")
Output from ```dbGetInfo(con_2)``` $profile_name [1] "other-profile-that-is-not-default" $s3_staging [1] "s3://aws-athena-query-results-for-other-profile-us-east-1" $dbms.name [1] "default" $work_group [1] "primary" $poll_interval NULL $encryption_option NULL $kms_key NULL $expiration NULL $region_name [1] "us-east-1" $paws [1] "0.1.7" $noctua [1] "1.6.0"

then, it still connects but again to the default database. How can I successfully connect to a non-default profile? It looks like dbGetInfo(con_2) has the correct information but for whatever reason it does not connect to the correct database and I see the exact same tables as for con.

This is what the .config file looks like

[default]
region = us-east-1
output = text

[profile other-profile-that-is-not-default]
region = us-east-1
saml.session_duration = 43200
saml.idp_arn = arn:aws:iam::XXXX:saml-provider/g-suite
saml.role_arn = arn:aws:iam::XXXXXXX:role/other-profile_that-is-not-default

and this is what the .credentials file looks like

[default]
aws_access_key_id = XXXX
aws_secret_access_key = XXXXXXX

[other-profile-that-is-not-default]
aws_access_key_id = XXXXX
aws_secret_access_key = XXXX
aws_session_token = XXXXX
aws_security_token = XXXXXX

Any help would be greatly appreciated especially since I have no idea how to set up the .config/.credentials file differently.

DyfanJones commented 4 years ago

Hi @mraess ,

This behaviour is correct. profile_name doesn't affect the schema/database you are connecting to. The profile_name basically passes your credentials safely to the SDK so you can make the connection. So in affect you will be connecting as different users with each profile_name depending on the credentials.

What I think you want to do is change the default schema when you make the connection. To do this you need to do the following:

library(DBI)
library(noctua)

# changing profile name but same schema
con1 <- dbConnect(athena())

con2 <- dbConnect(athena(), 
                  profile_name = "demo_profile")

# changing schema/database
con3<- dbConnect(athena(),
                  schema_name = "temp")

con4 <- dbConnect(athena(), 
                  profile_name = "demo_profile",
                  schema_name = "temp")

dbGetInfo(con1)
# $profile_name
# [1] "default"
# 
# $s3_staging
# [1] "s3://mybucket/"
# 
# $dbms.name
# [1] "default"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $paws
# [1] "0.1.7"
# 
# $noctua
# [1] "1.6.0.9000"

dbGetInfo(con2)
# $profile_name
# [1] "demo_profile"
# 
# $s3_staging
# [1] "s3://mybucket/"
# 
# $dbms.name
# [1] "default"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $paws
# [1] "0.1.7"
# 
# $noctua
# [1] "1.6.0.9000"

dbGetInfo(con3)
# $profile_name
# [1] "default"
# 
# $s3_staging
# [1] "s3://mybucket/"
# 
# $dbms.name
# [1] "temp"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $paws
# [1] "0.1.7"
# 
# $noctua
# [1] "1.6.0.9000"

dbGetInfo(con4)
# $profile_name
# [1] "demo_profile"
# 
# $s3_staging
# [1] "s3://mybucket/"
# 
# $dbms.name
# [1] "temp"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $paws
# [1] "0.1.7"
# 
# $noctua
# [1] "1.6.0.9000"

I hope this makes sense :)

DyfanJones commented 4 years ago

When you make a connection and you have multiple schemas in your Athena instance you can still query them for example:

library(DBI)
library(noctua)

# changing profile name but same schema
con1 <- dbConnect(athena())

dbGetQuery(con1, "select * from temp.iris_parquet")
# Info: (Data scanned: 666 Bytes)
# petal_length petal_width time_stamp
# 1:          1.4         0.2   20200421
# 2:          1.4         0.2   20200421
# 3:          1.3         0.2   20200421
# 4:          1.5         0.2   20200421
# 5:          1.4         0.2   20200421
# ---                                    
#   146:          5.2         2.3   20200421
# 147:          5.0         1.9   20200421
# 148:          5.2         2.0   20200421
# 149:          5.4         2.3   20200421
# 150:          5.1         1.8   20200421

The benefit of setting the schema_name in the connection is when you use dplyr for example:

library(dplyr)
con2 <- dbConnect(athena(),
                 schema_name = "temp")

tbl(con2, "iris_parquet")
# Info: (Data scanned: 666 Bytes)
# Source:   table<iris_parquet> [?? x 3]
# Database: Athena 0.1.7 [default@eu-west-1/temp]
# petal_length petal_width time_stamp
# <dbl>       <dbl> <chr>     
#   1          1.4         0.2 20200421  
# 2          1.4         0.2 20200421  
# 3          1.3         0.2 20200421  
# 4          1.5         0.2 20200421  
# 5          1.4         0.2 20200421  
# 6          1.7         0.4 20200421  
# 7          1.4         0.3 20200421  
# 8          1.5         0.2 20200421  
# 9          1.4         0.2 20200421  
# 10          1.5         0.1 20200421  
# … with more rows

However this still can be achieved with dbplyr::in_schema()

tbl(con1, dbplyr::in_schema("temp","iris_parquet"))
# Info: (Data scanned: 666 Bytes)
# Source:   table<temp.iris_parquet> [?? x 3]
# Database: Athena 0.1.7 [default@eu-west-1/default]
# petal_length petal_width time_stamp
# <dbl>       <dbl> <chr>     
#   1          1.4         0.2 20200421  
# 2          1.4         0.2 20200421  
# 3          1.3         0.2 20200421  
# 4          1.5         0.2 20200421  
# 5          1.4         0.2 20200421  
# 6          1.7         0.4 20200421  
# 7          1.4         0.3 20200421  
# 8          1.5         0.2 20200421  
# 9          1.4         0.2 20200421  
# 10          1.5         0.1 20200421  
# … with more rows

I hope this has helped with your initial query. I will leave this open just incase I miss understood you initial question.

mraess commented 4 years ago

Hey, Thanks for the quick answer - I'll certainly give this a shot. One question though, wouldn't the different s3 staging directories determine which database it connects to? And how does it know what the standard database is, which I'm connecting to right now?

I guess what I'm asking is IF I have all my tables in AWS athena, but some are under dev-a, which the default credentials are tied to and a second schema/database, prod-a to which the non-default profile is tied to, would my problem be solved with the schema setting you mentioned? I'm asking b/c of the above since I was under the impression that the staging directory is tied to the specific database, i.e. dev-a or prod-a.

Let me know if this makes sense :)

DyfanJones commented 4 years ago

s3_staging_dir is the s3 bucket that AWS Athena outputs it's results into. What noctua does is, it gets those results and returns them to R. By default noctua cleans up after each query to keep your S3 buckets tidy. However you can turn that off by activating the caching feature: AWS Athena Query Caching

By default schema_name = "default" this is just coded. So by changing the schema_name to the correct database it should work. In the backend "noctua" passes your connection database to the Athena api call functions. This is what I am guessing is the problem you are coming into.

I am not sure how to get the schema_name from your profile using the AWS SDK's. If you know how then I am happy to receive a pull request or even if you share the code in this thread. I can add this to the package :)

mraess commented 4 years ago

Haha, sounds good - I just might have to hard-code the schema_name. To clarify, this should work even if dev-a and prod-a are different databases with their own schemas.tables, correct? E.g. ingest.table-a in dev-a would that have ingest as the schema name?

DyfanJones commented 4 years ago

It should be able to query across schema's so for example:

library(DBI)
library(noctua)

con <- dbConnect(athena(),
                  schema_name = "default")

# querying default schema
dbGetQuery(con, "select * from iris")
# Info: (Data scanned: 860 Bytes)
# sepal_length sepal_width petal_length petal_width   species
# 1:          5.1         3.5          1.4         0.2    setosa
# 2:          4.9         3.0          1.4         0.2    setosa
# 3:          4.7         3.2          1.3         0.2    setosa
# 4:          4.6         3.1          1.5         0.2    setosa
# 5:          5.0         3.6          1.4         0.2    setosa
# ---                                                            
# 146:          6.7         3.0          5.2         2.3 virginica
# 147:          6.3         2.5          5.0         1.9 virginica
# 148:          6.5         3.0          5.2         2.0 virginica
# 149:          6.2         3.4          5.4         2.3 virginica
# 150:          5.9         3.0          5.1         1.8 virginica

# querying temp schema
dbGetQuery(con, "select * from temp.iris_parquet")
# Info: (Data scanned: 666 Bytes)
# petal_length petal_width time_stamp
# 1:          1.4         0.2   20200421
# 2:          1.4         0.2   20200421
# 3:          1.3         0.2   20200421
# 4:          1.5         0.2   20200421
# 5:          1.4         0.2   20200421
# ---                                    
# 146:          5.2         2.3   20200421
# 147:          5.0         1.9   20200421
# 148:          5.2         2.0   20200421
# 149:          5.4         2.3   20200421
# 150:          5.1         1.8   20200421
mraess commented 4 years ago

Thanks for the quick replies!! I'll check this out for sure. Thanks also for keeping the issue open for now!

mraess commented 4 years ago

Hey, so I implemented your suggestions - unfortunately to no avail. I have been able to narrow the problem down some more though.

A little more context: both dev-a and prod-a are in two completely different AWS accounts.

When I replace the key/secret for dev-a in the default profile in .credentials with a key/secret pair for prod-a, it does connect to prod-a. Even though in the actual r code everything is left the same, e.g.

con_2 <- dbConnect(noctua::athena(),
                   profile_name = "other-profile-that-is-not-default",
                   s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/", 
                   region = "us-east-1")

pointing to a non-default profile.

This leads me to believe that, for some reason, only the default profile is being picked up b/c what changed wasn't the r code but the underlying credentials in the default profile, while the function was pointing to a non-default profile.

Another thing I tried was to put SAML credentials into the default profile in the .credentials file to see if that would work.

[default]
aws_access_key_id = XXXXX
aws_secret_access_key = XXXX
aws_session_token = XXXXX
aws_security_token = XXXXXX

In the .config file I put

[default]
region = us-east-1
saml.session_duration = 43200
saml.idp_arn = arn:aws:iam::XXXX:saml-provider/g-suite
saml.role_arn = arn:aws:iam::XXXXXXX:role/other-profile_that-is-not-default

This however results in an error:

Error: UnrecognizedClientException (HTTP 400). The security token included in the request is invalid.

Anyways, these might be two different problems altogether but it seems like the function is only picking up the default profile from the .config file.

Again, any insights and help are greatly appreciated! I'm really hoping to figure this out :)

DyfanJones commented 4 years ago

Will have a little look tomorrow :)

mraess commented 4 years ago

Awesome, thanks!!

DyfanJones commented 4 years ago

I am just a little confused with what is going on, sorry. Do you mind running through with me again so I have it straight in my head :)

Please correct me if I have miss understood

Your initial comment was able to make the connection to both AWS Accounts but not to the right databases. As you was able to get connection objects from both dbConnect() and view the metadata with dbGetInfo().

With your initial .credentials/.config file did you try the following?

con_2 <- dbConnect(noctua::athena(),
                   profile_name = "other-profile-that-is-not-default",
                   schema_name = "prod-a",
                   s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/")
mraess commented 4 years ago

Hey, thanks for following up. I'm happy to provide some more details. Yes, I tried the code at the bottom but that did not work.

For context, assume that dev-a and prod-a are completely different AWS accounts (which they are) like from two different customers. Under each of these accounts, there are several schemas with one ore more tables.

However, I tried to narrow the use case down even more to make potential debugging easier :)

I wrote a short test script, which I then ran from the terminal and which prints out a list of tables under the connection. This allowed me to narrow down the problems more

library(noctua)
library(DBI)
con_3 <- dbConnect(noctua::athena(),
                   profile_name = "other-profile-that-is-not-default"
                   s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/", 
                   region = "us-east-1")

dbListTables(con_3)

Observations: 1) In the .credentials file, I removed the default profile. When you run aws client command-line tools with other-profile-that-is-not-default profile active, it works fine, i.e. in the terminal, it lists all the right tables, etc. This shows that the credentials in the profile are valid and work. However, when running the test script above - it says it can’t find any credentials while the aws cli finds them fine.

2) IF there is no AWS environment variable, it does not pick up a named profile even when passed as a parameter to the function. This happens even when static keys/secret are put into the .credentials file for other-profile-that-is-not-default

Error in f() : No credentials provided
Calls: dbListTables ... sign_with_body -> get_credentials -> call_with_args -> f
Execution halted

3) Even when AWS profile passed in via the environment, whether you pass profile_name or not, you get the following error

Error: UnrecognizedClientException (HTTP 400). The security token included in the request is invalid.
Execution halted

This seems to indicate that 1) the profile_name is being ignored for sure and the connection works only with static key/secret under [default] profile and 2) that STS credentials, which have a session token and a security token, are not working at all even when put int he default profile.

In conclusion, hopefully, this test proved that 1) the profile_name parameter is being ignored, but that the package is able to talk to different profiles/accounts when put in the [default] profile with static key/secret. And 2) that STS credentials don't work at all whether they are in [default] or not.

Please let me know if this makes sense. I'm happy to provide further insights/clarification.

Thanks again for your awesome responsiveness here!!! Happy Friday!

DyfanJones commented 4 years ago

Ah ok cool, just had a little look and the profile not being passed correctly looks like it was a known issue with the paws sdk which has been fixed with the dev version of paws.common

https://github.com/paws-r/paws/pull/283

remotes::install_github("paws-r/paws/paws.common")

From a quick test this fixes apart of your issue but I don't know if it will fix the second part.

For the STS what code are you using? I might need to added more explicit support for SAML access methods.

pmarkert commented 4 years ago

For the STS use-case, we are using a SAML provider in conjunction with sts assume-role-with-saml in order to establish the temporary STS session. STS generates temporary AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SECURITY_TOKEN values that can be stored in the ~/.aws/credentials and then used with standard SDK tools to call AWS services. Once the session is established, it works like any other STS session, so there is no more SAML involved at that point. This is similar to what you would get if you used normal keys and called aws sts get-session-token to generate temporary session keys for short-term expiration or a multi-factor upgrade.

DyfanJones commented 4 years ago

@pmarkert Thanks for the for the extra information. I will need to create a branch that gives more explicit support for SAML connection methods similar to assume_role(). I will need to do a little extra reading so I know how to set up my own aws account with saml connection method so i can add some unit tests around this :)

mraess commented 4 years ago

@DyfanJones thanks a lot!!

DyfanJones commented 4 years ago

@mraess i will create a initial pull request is it possible for you to test? I currently don't know how to set up my aws account for saml connections. This would be a massive help :)

pmarkert commented 4 years ago

Actually, it looks like this second issue is also related to that underlying driver library. https://github.com/paws-r/paws/issues/280 describes exactly the problem. From your code, you are already passing the aws_session_token through any time the access_key is referenced, which would be all you need to do to ensure the proper support, so I don't know that you need to setup any SAML or role integrations.

DyfanJones commented 4 years ago

@pmarkert Thanks for finding that out. @mraess are you able to trying this out with the dev version of paws.common?

remotes::install_github("paws-r/paws/paws.common")

mraess commented 4 years ago

@DyfanJones yes, I'll get on that later today or early next week and let you know how it goes :)

DyfanJones commented 4 years ago

I am getting the package ready for the next release which comes with retry functionality, dbFetch can return data in chunks greater than 1000 when n != -1 and simple wrapper for converting backend s3 files. If you come into any more issue I would like to capture them in this release so that it is all live on the cran :)

mraess commented 4 years ago

@DyfanJones Sounds great!

pmarkert commented 4 years ago

I was able to test with the updated paws library and it looks like it is working to me, both parts reading from a named profile and picking up the STS session. Thanks!

DyfanJones commented 4 years ago

Thanks @pmarkert for testing. I will close this ticket. If this issue crops up again please re-open. Or open up another ticket :)