gabfl / bigquery_fdw

BigQuery Foreign Data Wrapper for PostgreSQL
MIT License
90 stars 15 forks source link

Assigned service roles in GCP instead of service account's key? #18

Closed Clausewitz45 closed 3 years ago

Clausewitz45 commented 4 years ago

Hi,

I would like to implement your solution (which is great!) but because of CIS Benchmark GCP 1.4 (no user managed service account keys are enabled - so I basically cannot create keys for the service accounts), I can only assign the required roles to the Compute Engine / Virtual Machine. Does this FDW is able to pick up the roles from the VM itself without submitting any key to it?

Thank you for your response in advance.

gabfl commented 4 years ago

Hi,

How do other programs pick up the roles from the VM? Would you have an example in Python to show me?

thanks

Clausewitz45 commented 4 years ago

Hi,

many thanks for the response. I talked with one of our developers, and he came back to me with this two link:

Since I'm only an engineer, I cannot judge if this is enough to start, but I will try to add more examples later.

Thanks

preston-hf commented 4 years ago

Came here interested in this. When deployed in GCP, the metadata server is interrogated for authentication tokens. Instead of explicitly calling the auth function here, you should check to see if the user specified a key path first, and if they did not, just create the client like this:

self.client = bigquery.Client()

This will use Application Default Credentials when running on a developer's machine, and the metadata server when running in GCP. The key option is still useful for scenarios other than this, like running in AWS or something self-hosted. The implicit setup will also look at GOOLE_APPLICATION_CREDENTIALS environment variable for a key path.

gabfl commented 4 years ago

@preston-hf would you be able to create a pull request with this change? I do not use GCP so it's hard for me to test this usecase

thanks

shadiramadan commented 3 years ago

I was just about to start working on this but then I dived into the code and realized @gabfl you support different credentials on a per client basis which makes this way more complicated.

While it is cool- I really don't think you should allow nor support that flow because it goes counter to how google aims to manage credentials.

The GOOGLE_APPLICATION_CREDENTIALS environment variable which @preston-hf brings up which points to a service account key is really the way to go. If you want to connect to BigQuery tables across multiple projects then you need to give the service account you create BQ access in those other projects.

You can read more about credentials here: https://cloud.google.com/docs/authentication/getting-started

Another reason to update this is that for my use case I'm running a pg container in a GKE cluster and we are using a feature called workload identity to manage access: https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#overview

This only works if you rely on google's default authentication mechanism (i.e. not explicitly providing a service account key).

How would you feel about such changes?

Essentially- we would get rid of the fdw_key option and users would need to make sure the GOOGLE_APPLICATION_CREDENTIALS variable is set and pointing to a service account access key file readable by postgres.

gabfl commented 3 years ago

@shadiramadan

Apologies for the delay in responding.

I tent to agree that people using different values for fdw_key is very edgy and might not exists. I think the suggested change makes sense.

How would you proceed to have Postgres/Python be able to read that env variable?

Would you be willing to work on a pull request?

preston-hf commented 3 years ago

Usage of bigquery_fdw should be possible without creating a key at all. They have a bunch of security downsides and should be avoided where possible. I don't think you don't need an environment variable, I believe the client automatically looks for the relevant variables.

gabfl commented 3 years ago

Indeed it looks like the client does look for env variables if nothing is set on the application side: https://cloud.google.com/bigquery/docs/authentication/getting-started

I don't use GCP on a daily basis but is the env variable available by default on some GCP instances?

preston-hf commented 3 years ago

It doesn't actually use environment variables on GCP. The way it works is each VM/cloud function/etc has a link-local "metadata server" available at metadata.google.internal which resolves to a link-local address. The client libraries make a request to this service to generate auth tokens and discover other info about the runtime environment.

In addition, for developers, you typically setup Application-Default Credentials using the SDK, and the client libraries check a "well-known" path for the ADC token. Basically, for most environments, just calling bigquery.Client() should just work, if you need to override the defaults to work in a non-GCP prod environment, you can set the environment variables that should be picked up. Unfortunately I'm not using this library at the moment so can't test it out.

gabfl commented 3 years ago

I wrote a draft/untested PR here https://github.com/gabfl/bigquery_fdw/pull/20, I will stage it and test it when I get a chance.

If any of you would like to contribute to the testing/finalizing, it would be really helpful

gabfl commented 3 years ago

@preston-hf @Clausewitz45 @shadiramadan The PR is ready to be merged, I will finalize some testing over the weekend and merge it. Please let me know if you have some feedback on the changes in the meantime.

Changes to the authentication process are documented here: https://github.com/gabfl/bigquery_fdw/tree/native-creds#authentication

gabfl commented 3 years ago

merged and version 1.8 has been relesed