airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.4k stars 3.97k forks source link

BigQuery impersonate service account #15726

Open joshk0 opened 2 years ago

joshk0 commented 2 years ago

Tell us about the problem you're trying to solve

In the BigQuery connector, I would like to use Application Default Credentials (I think this is already supported) but then use those credentials to impersonate a different service account.

Example: https://github.com/salrashid123/gcp_impersonated_credentials/blob/main/java/src/main/java/com/test/TestApp.java#L28-L30

Describe the solution you’d like

BigQuery connector should continue to accept JSON credentials for authenticating but then fall back to ADC. Then, if the configuration field for setting a "Account to impersonate" is set, we should attempt to impersonate that account before attempting to access the table.

For accessing the target table, if this feature works, then the ADC credentials will not need any access to the target BigQuery, only the impersonated account does.

This is a crucial component for using Airbyte in a multi-tenant customer environment.

Describe the alternative you’ve considered or used

We have no alternative; our security posture as a business is strengthened by using as few hardcoded credentials as possible. Airbyte's BigQuery connector is not usable for us at this time until we can use our existing service accounts in a credentialless fashion.

Additional context

https://cloud.google.com/iam/docs/impersonating-service-accounts

Are you willing to submit a PR?

It'll probably be a very long time before I'd have time to, but theoretically I am willing to.

marcelopio commented 2 years ago

Just to link, I added ADC, but not as a fallback, because that would require that the worker had access to the host environment. So the way to get ADC now is passing the json generated by gcloud auth application-default login

https://github.com/airbytehq/airbyte/pull/14784

marcelopio commented 2 years ago

I think to add the impersonation would be just to add a field with the account to be impersonated, and that is basically it, if you are ok with using the application-default login json. I can make a PR with that idea

marcosmarxm commented 2 years ago

Thanks @marcelopio the team will review your proposal this week.

joshk0 commented 2 years ago

I would rather not have JSON anywhere in my authentication pipeline.

The power for ADC to be attached to workloads, cloud function invocation or compute instances by the provider implicitly, rather than with a data file that can escape the system is crucial to my company's security posture.

marcelopio commented 2 years ago

ADC will work on compute engines without any json provided, just not on developers machines

joshk0 commented 2 years ago

Sure, that's fine, and JSON should be a valid option for dev machines. But I'd like to use better practices in prod.

marcelopio commented 2 years ago

Nice, then draft solution should be fine!

joshk0 commented 2 years ago

Got it - I misunderstood your sentence back there then. Thanks!

marcelopio commented 2 years ago

My fault, I didn't explain the whole problem with ADC.

Google ADC implementation has a lot of steps and what everyone generally assumes is the use of GOOGLE_APPLICATION_CREDENTIALS. That won't work with Airbyte unfortunately, but if you are on Google environment that exposes the credentials via API it should work.

This service account might be a default service account provided by Compute Engine, Google Kubernetes Engine, App Engine, Cloud Run, or Cloud Functions.

It will also work on Cloud Shell

grishick commented 1 year ago

@marcelopio I took your PR, added a test for impersonation and pushed to this PR: https://github.com/airbytehq/airbyte/pull/20788