mdibaiee commented 2 years ago

How do we integrate OAuth into our connectors?

mdibaiee commented 2 years ago

Terms and Concepts:

User: The end user of Flow, the person sitting behind the computer
OAuth Providers: Google, Facebook, GitHub, etc. which provide us with an OAuth server through which we get access to a user's data on their platform.
Developer Application: An application that we have to register with OAuth providers. For example, we will need an Estuary developer application on Google, GitHub, Facebook, etc. which we will use to request access to a user's account. Our Developer Application is identified with a client_id and protected by a client_secret.
Scopes: Each OAuth Provider supports a list of scopes, which are basically categories of access, e.g. repo:read and repo:write on GitHub allow for read and write access to a user's repositories, respectively.
Redirect URL: Our Developer Application needs to register a Redirect URL, which, once the user has finished filling in the authorization form for the target provider, they will be redirected to. This allows us to receive data from the provider about whether the OAuth request was successful / failed and which scopes are available.
State Parameter: We can pass a state in our request for authorization to the OAuth Provider, and the provider will return the same state parameter back to us once the authorization process is done. This can be used for multiple purposes such as redirecting to dynamic URLs, and protecting against CSRF attacks by checking the integrity of the OAuth session.

OAuth Flow (See the diagram)

Send the URL to the OAuth provider along with our Developer Application's client_id, the State Parameter, and the Scopes we would like to have access to. Example:

https://github.com/login/oauth/authorize?response_type=code&client_id=$CLIENT_ID&redirect_uri=https://dashboard.estuary.dev/oauth&scope=repo:read&state=$STATE

The user will be presented with an authorization page like the one below:
Once the user approves the authorization, they are then redirected to our requested Redirect URL, along with a code and state parameter representing the authorization code and the State Parameter. Example:

https://dashboard.estuary.dev/oauth/?code=0c245f0a0545bb4d3db8&state=$STATE

We can now send a request in the backend to ask for an Access Token from the OAuth Provider, using the provided code:

curl -XPOST https://github.com/login/oauth/access_token -d grant_type=authorization_code -d client_id=$CLIENT_ID -d client_secret=$CLIENT_SECRET -d code=0c245f0a0545bb4d3db8

In response, we will receive an access token, along with the scopes of this access token:

access_token=gho_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&scope=&token_type=bearer

The response may additionally include a refresh_token along with an expiry duration for the access_token, which indicates that the access token will expire at some point:

"access_token": "AYjcyMzY3ZDhiNmJkNTY",
"refresh_token": "RjY2NjM5NzA2OWJjuE7c",
"token_type": "bearer",
"expires": 3600

The refresh_token allows us to request a new access token in the backend without the code parameter, i.e.:

curl -XPOST https://github.com/login/oauth/access_token -d grant_type=refresh_token -d client_id=$CLIENT_ID -d client_secret=$CLIENT_SECRET -d refresh_token=$REFRESH_TOKEN

The response will again include yet another refresh_token which must be used to get a new token for the next time.

Challenge: in order to refresh the token, the client_secret is necessary, but we should ideally avoid having our client_secret in the connector (a malicious connector can extract our client_secret and use it). Airbyte sends the client_secret and client_id to the connector, but I think that's not a safe approach for us.

This means we need a way of automatically renewing all the OAuth access tokens that are about to expire, and giving the new access tokens to the connectors. A basic way would be to have a cronjob that renews access tokens, updates catalogs with the new access token, and triggers a restart of the connector.

Currently the Airbyte connectors require client_id and client_secret and refresh_token provided to them so they can renew the access token themselves. My understanding of reading the code here is that the connector starts by asking for a refreshed token, which is not very helpful for us.

We can change the code so that we do not refresh the token, and only use the access token provided for OAuth. The expectation from the connector will then be to have a new token provided by Flow before the expiry of the current token, to avoid token expiry errors.

mdibaiee commented 2 years ago

Things we need to support OAuth for connectors:

[ ] Developer Accounts on all providers that we want to authenticate with (Google, GitHub, Facebook, ...)
[ ] Credentials (client_id and client_secret) of our Developer Accounts stored somewhere
[ ] Introduce OAuth button to certain connectors in the UI. How we do this is an open question, but probably using the JSONSchema spec of the connector is the path of least resistance.
[ ] Backend pieces in Control Plane for:
1. [ ] Creating the initial authorization URL to send the user to when they click the OAuth button. This must be done in the backend because we need to generate a State Parameter that is known to the server and will be used to verify the integrity of the OAuth session when we get a response from OAuth provider.
2. [ ] Receive the OAuth response from the provider, fetch an access token and refresh token, and store those somewhere. Update the catalog of the corresponding capture and add the access token to it. (I am not 100% sure about this but I suppose we will already have a draft which we can update to add the access token to the config file of).
[ ] A service for automatically renewing access tokens when they are near expiry, updating the catalogs to use the new access token, and communicating this change to connectors (probably just a restart).

mdibaiee commented 2 years ago

Question raised by @dyaffe:

The challenge you called out is interesting but does that imply more intelligence around how to refresh tokens from different specific systems in the runtime?

Answer:

The “basic” approach I’ve suggested assumes that the service that refreshes tokens knows how where a token has come from (e.g. Google, Facebook, GitHub or somewhere else).

All this service needs to know is the URL to which to send the token request to. This URL can be stored alongside the initial refresh token when the user first presses the OAuth button. This URL can be provided by the connector config spec, as a custom JSONSchema field for example. e.g.:

title: OAuth
x-token-url: https://accounts.google.com/o/oauth2/v2/auth
type: object
required:
  - access_token
properties:
   access_token:
      title: Access Token
      type: string
      description: The token for accessing the service
      secret: true

jgraettinger commented 2 years ago

Thanks, this was a super helpful writeup. A few things I've learned from it:

The single-use nature of refresh tokens means the agent must be responsible for refreshes, and each refresh must change the spec's endpoint configuration.

The upside of this is it keeps connectors from needing to implement refresh. It also provides a simple answer to "how do running connectors get updated tokens?": they get it through the existing mechanisms for publishing an updated spec with a new refresh and access token.

Something will need to receive these oauth callbacks from providers.

The config encryption service seems the obvious way to slice up service boundaries. It's a public service with a stable, known endpoint that already is responsible for receiving / encrypting secure credentials.

We can embed a token within the state callback parameter that lets the config service pass the response to the initiating agent through some pub/sub channel (Postgres LISTEN / NOTIFY? Add to append-only table?).

BUT we also want to keep this service as least-privileged as possible: it's able to encrypt but not decrypt credentials via sops. It also must not have privileged access to the database.

We want to treat access and refresh tokens as secret credentials protected via sops using our existing mechanism.

This one is obvious, but these are secrets and we don't ever want them in the DB unencrypted.

Side note: I encountered https://github.com/markbates/goth in my own prototyping of the (now junked) authentication service.

It wraps up a whole lot of details of dealing with specific providers, such as their particular URLs, callbacks, and warts of their implementation, and may be helpful in this design.

psFried commented 2 years ago

Thanks for the write up.

Airbyte sends the client_secret and client_id to the connector, but I think that's not a safe approach for us.

IMO this is worth questioning. I'll try to express my ambivalence:

It seems reasonable to me to require that we trust the connector if it's going to use OAuth with our client_id. Additionally, I could imagine the client_id and client_secret being tied to each specific connector (e.g. as columns on connector_tags). In other words, I think the question we're sort of dancing around is whether or not to support running untrusted connectors that use OAuth. I think it'd be reasonable to say no, at least for now. But if we did want to support that, then it would also seem reasonable to say that whoever provides the connector must also provide their own client id and secret.

The issue of refresh token rotation seems a little more motivating. If we pushed down the refresh_token => access_token exchange into the connector, then what happens when a provider rotates the refresh token on every exchange? The connector doesn't really have a way to securely store the new refresh token, and so maybe storing it in the sops-encrypted endpoint config really is the best option. Technically, access tokens could be very short-lived, and it seems unfortunate to require a catalog re-build every time we need to refresh it. But I could also see some sense in exercising those code paths frequently and ensuring that they stay "well oiled".

I think it would help to have a better understanding of a few of these things:

What is the minimum expiration duration for access_tokens in the real world? In other words, how frequently do we need to support updating these things?
How common is refresh token rotation in the real world? Is it ubiquitous? Is there a subset subset of oauth providers that don't use it?
What happens if we successfully exchange a refresh token for an access token and a new refresh token, and then fail to persist the new refresh token? Would we need to have the user go through the whole oauth flow again, or is it possible to still re-use an old refresh token for a time?

This is maybe going off on a bit of a tangent, but why don't connectors have a way to securely store state? If we came up with a way for connectors (or tasks, generally) to encrypt all or part of their state, it could seemingly enable us to push down the refresh token exchange into the connectors. Assuming that were possible, would there still be reasons to have the agent handle the exchange?

jgraettinger commented 2 years ago

client_secret is only ever used in the server-side call to retrieve an access + refresh token, correct?
You give an example of retrieving tokens using client_id and client_secret, plus a code received via callback. In your example you also have a redirect_uri parameter:

curl -XPOST https://github.com/login/oauth/access_token -d grant_type=authorization_code -d client_id=$CLIENT_ID -d client_secret=$CLIENT_SECRET -d redirect_uri=https://dashboard.estuary.dev/capture/123 -d code=0c245f0a0545bb4d3db8

Is this actually correct, is a redirect used? I was under the impression this is a server-side call which will directly return the requested tokens.

mdibaiee commented 2 years ago

client_secret is only ever used in the server-side call to retrieve an access + refresh token, correct?

Yes

Is this actually correct, is a redirect used? I was under the impression this is a server-side call which will directly return the requested tokens.

It is necessary to pass the redirect_uri in that request, too. The response does not actually redirect (it's a POST request with a 200 response and a body), but the redirection URI must stay consistent throughout the chain of OAuth calls. I will have to dig a bit deeper to understand why but I think it's a form of check against hijacks. See here for an example.

jgraettinger commented 2 years ago

Okay, thanks. That cuts down on the blast radius of where the client_secret must exist, and redirect_uri as a checked but not an effective parameter also makes sense.

Big picture, this seems the biggest question to run down:

The response will again include yet another refresh_token which must be used to get a new token for the next time.

Is that ... really true? Across how many APIs? A response may include a refresh_token, but it's surprising to me that any vendor would treat a refresh token as single use (since the oauth provider is able to revoke a refresh token at any time).

Google docs, for example, convey that a token is used repeatedly.

Are there vendors which would preclude us from simply storing and re-using a refresh token forever?

Edit to add: Likely yes. See this article on refresh token rotation and reuse detection.

Edit again to add: Here's GitHub's position. Refresh tokens are valid for 6 months and responses include their expiration time. 6 months is a while, but isn't so long that we don't need to think about automatic refresh.

Edit again to add: Salesforce's implementation doesn't appear to even offer refresh token rotation.

psFried commented 2 years ago

Refresh tokens are valid for 6 months and responses include their expiration time. 6 months is a while, but isn't so long that we don't need to think about automatic refresh.

This seems to answer the question I had about failure scenarios in handling the refresh token exchange. It sounds like we'd be fine to just re-use the old token in the case that we fail to persist the new one after an exchange. That seems good and sane to me.

mdibaiee commented 2 years ago

It seems reasonable to me to require that we trust the connector if it's going to use OAuth with our client_id. Additionally, I could imagine the client_id and client_secret being tied to each specific connector (e.g. as columns on connector_tags). In other words, I think the question we're sort of dancing around is whether or not to support running untrusted connectors that use OAuth. I think it'd be reasonable to say no, at least for now. But if we did want to support that, then it would also seem reasonable to say that whoever provides the connector must also provide their own client id and secret.

I think your general line of thinking makes sense, however, in practice, do we have any safeguards against someone updating their flow catalog using flowctl to point to their own connector image and running it? In general, this distinction between trusted and untrusted connectors itself is something we need to implement and make sure the distinction and the security safeguards are held at all points of contact, which can be difficult.

What is the minimum expiration duration for access_tokens in the real world? In other words, how frequently do we need to support updating these things?

My understanding from some readings is that it is usually "long enough", i.e. usually longer than 30 minutes. My reference are these docs:

The RFC uses 1 hour as the example value for expiry of access_token: https://www.rfc-editor.org/rfc/rfc6749#section-4.2.2
Google's docs for Apigee consider 30 minutes to be a "reasonable" expiration time: https://cloud.google.com/apigee/docs/api-platform/antipatterns/oauth-long-expiration
The default for Auth0 seems to be 24 hours: https://auth0.com/docs/secure/tokens/access-tokens#custom-api-token-lifetime

Note that refresh_token expiry is usually longer than access_token expiry (by an order of magnitude) to allow refreshing of the access_token for a considerable amount of time after it has expired.

How common is refresh token rotation in the real world? Is it ubiquitous? Is there a subset subset of oauth providers that don't use it?

My understanding is that access tokens are not meant to last very long, and are meant to be used for short flows in a UI, the method of using refresh_token is also known as the offline access method, which is what we want: we want to keep having access to the user's account even when they are offline. I think for us, we can consider that this will be the de facto method.

jgraettinger commented 2 years ago

Just to throw in a counter-point, Facebook apparently doesn't offer refresh tokens and their access tokens last 60 days :shrug:

mdibaiee commented 2 years ago

@jgraettinger that's probably why Airbyte doesn't have OAuth for Facebook... but we would like to be able to have OAuth with Facebook to go around the ratelimits... so that's challenging

mdibaiee commented 2 years ago

For some reason Cmd+Enter is "Close with comment" now for me 🙄

dyaffe commented 2 years ago

@mdibaiee I can verify that they do offer it on their cloud product

mdibaiee commented 2 years ago

@dyaffe ah okay, I checked the spec in airbyte/airbyte and couldn't find OAuth there. I'll look to see how can we refresh tokens for Facebook

mdibaiee commented 2 years ago

Next steps:

[x] Check Facebook refresh mechanism
[x] Check if we can do pop-up OAuth always (what about mobile phones, etc.?)
[x] Simple CLI for basic operations of OAuth (potentially use goth) - see https://github.com/estuary/oauth

mdibaiee commented 2 years ago

Facebook:

The manual process of OAuth is described here: this lends us a short-lived token
The process of getting a long-lived token from a short-lived token is described here
From this doc ^ my understanding is that the token is automatically refreshed whenever a request is sent:

These tokens are refreshed once per day, when the person using your app makes a request to Facebook's servers. If no requests are made, the token will expire after about 60 days and the person will have to go through the login flow again to get a new token.
I dug up the old Facebook JS SDK, and it seems to refresh the token it just sends a login_status request every 20 minutes. Again my understanding is that it just wants to send a no-op request to refresh the token: 1, 2, 3

So my conclusion is that Facebook allows us to refresh the token, we just have to send a request with the given access token. The login_status request seems like a sensible no-op request to send for this.

mdibaiee commented 2 years ago

OAuth with pop-up window is described here and here, however I have evidence that in some platforms this may not be supported, e.g.:

https://stackoverflow.com/questions/27870370/authwithoauthpopup-doesnt-work-on-mobile

In such cases we need to have a browser redirect solution in place. So my take is that the browser redirect is the most primitive way of supporting OAuth2, but to improve UX on platforms that support it, we can use window popups. In our case, I'm almost confident all desktop browsers will be fine with popup windows for OAuth, so we can start with that, and then implement browser redirect. Depends on how our frontend people prefer it.

mdibaiee commented 2 years ago

After two discussions in the past couple of days, here is an outline of our plan for OAuth:

Technical Flow

Connectors can, using their SpecResponse to signal that they support OAuth, and they can provide us with templates for creating requests to:
1. Generate the initial auth URL which the user is redirected to (e.g. the login page of the provider)
2. Request an access token given the response of step i
3. Refresh an access token given the response of step ii The precise form of this SpecResponse is yet to be determined, but just for the sake of intuition, here is an example below. Also see Airbyte's specification:

x-oauth:
  provider: google
  auth-url: "https://accounts.google.com/o/oauth2/auth?access_type=offline&client_id={{ .ClientId }}&redirect_uri={{ .RedirectURI }}&response_type=code&scope=email&state={{ .State }}"
  access-token-url: "https://oauth2.googleapis.com/token"
  access-token-body:
    grant_type: authorization_code
    client_id: "{{ .ClientId }}"
    client_secret: "{{ .ClientSecret }}"
    redirect_uri: "{{ .RedirectURI }}"
    code: "{{ .Code }}"
  refresh-token-url: "https://oauth2.googleapis.com/token"
  refresh-token-body:
    grant_type: refresh_token
    client_id: "{{ .ClientId }}"
    client_secret: "{{ .ClientSecret }}"
    refresh_token: "{{ .RefreshToken }}"

The UI will show the OAuth provider identifier to show the right button. Clicking this button will open the auth-url of the OAuth provider, and the flow begins.
The Redirect URI itself can also be a page of the dashboard, in which case, our JavaScript code can verify the state and send the code parameter back to a backend endpoint to generate the first access_token and refresh_token if available.
Once we have the first set of tokens, we then update the endpointSpec of the connector so that it includes access_token, refresh_token and expires_at.
The client_id and client_secret parameters are associated with connector images, which means we will be able to use Estuary's OAuth clients for our own trusted connectors (i.e. ghcr.io/estuary/*). These credentials are provided to the connector by the runtime, this means that our users will not have access to such credentials, and only trusted connectors will have access to these credentials.
This allows 3rd-parties to bring their own connectors along with their own client_id and client_secret to allow their connector to be run on our platform while keeping their credentials secure.
Connectors will be responsible to use client_id, client_secret, access_token, refresh_token and expires_at to keep their access by refreshing the tokens if necessary.
Still, the refresh_token itself needs to be rotated. This usually only needs to happen over long periods of time (e.g. 30 days and more). This will be the responsibility of control plane / agent to renew refresh tokens and update the endpointSpec of the corresponding connector.

Implementation Steps

[ ] Add client_id and client_secret to the connectors table in control plane. Make sure the secret column is not accessible by default and only specific priviledged services (e.g. the service for generating the tokens) have access to this column: https://github.com/estuary/animated-carnival/pull/32
[x] Disallow 3rd people to create docker images on ghcr.io/estuary without merging to upstream (e.g. pull-requests from 3rd people should not allow them to build and publish images on our repository): pull-requests from forks can't run our actions because they do not have the secrets (e.g. ghcr credentials), and so their workflows fail: https://github.com/estuary/connectors/pull/280 -- We must also make sure we do not allow Workflow runs for PRs that change workflows: https://docs.github.com/en/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks
[ ] Implement a backend service which accepts the result of the authorization flow (i.e. code) for a specific connector and requests an access_token and refresh_token from the provider, and returns these tokens.
[ ] Implement a recurring job that goes over all refresh_tokens and renews them, updating the endpointSpec of connectors.
[ ] Implement the UI for rendering the button according to JSONSchema of connectors
[ ] Implement the page for handling the response of authorization flow and sending requests with code to get access_token
[ ] Implement logic for injecting client_id and client_secret of connectors before running them
[ ] Implement OAuth for one connector to test this flow

psFried commented 2 years ago

After two discussions in the past couple of days, here is an outline of our plan for OAuth:

@mdibaiee this seems like a really solid plan. Thanks for putting it all together and making it so clear.

Implement the UI for rendering the button according to JSONSchema of connectors

I'm curious about this point. How would a schema represent the desire to show a "Login with OAuth" button? One consideration is that some connectors may want to support other authentication options in addition to OAuth, like an api key. So I'm thinking it should be able to be rendered inside of a oneOf. Also, the Flow agent needs to not only understand how to extract things like the access_token from the response bodies, but it also needs to understand how to extract and set those fields within the endpoint config.

I don't love doing this type of thing by convention (matching well-known field names) because it feels very opaque and magical to developers who don't already know the convention. But I do like that convention enforces consistency in the field names. An alternative that comes to mind would be some sort of JSON schema annotation like x-oauth-property: "refresh_token" (:wave:) that ties a location in the schema to a given property. I'm ambivalent about which I like better, but curious to get others' thoughts.

edit to add: Another detail is that the UI will need some way of knowing that it should not render normal inputs for these fields (access_token, etc.).

mdibaiee commented 2 years ago

So I'm thinking it should be able to be rendered inside of a oneOf.

I agree, and this is how Airbyte connectors are working currently, e.g.: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-google-sheets/google_sheets_source/spec.yaml#L26-L52

the Flow agent needs to not only understand how to extract things like the access_token from the response bodies, but it also needs to understand how to extract and set those fields within the endpoint config. An alternative that comes to mind would be some sort of JSON schema annotation like x-oauth-property: "refresh_token" (👋) that ties a location in the schema to a given property. I'm ambivalent about which I like better, but curious to get others' thoughts.

I personally prefer using the conventions of OAuth for the field names, mostly because OAuth itself is an established convention, but your idea of being able to specify different locations also works, and it could follow the convention by default, if the user has not specified any different locations for OAuth fields.

jgraettinger commented 2 years ago

How would a schema represent the desire to show a "Login with OAuth" button?

Today, we invoke connectors to fetch specs in the agent, which is charged with updating the connector_tags table. It maps the response into state of that table, such as the protocol, the endpoint config, the resource config, the documentation url, etc.

It seems rather natural to me that it would extract authentication metadata from the SpecResponse and store it in this table where the UI is able to inspect it. While we can further abuse the endpoint_config schema for this purpose, I think we should really try not to...

mdibaiee commented 2 years ago

Today, we invoke connectors to fetch specs in the agent, which is charged with updating the connector_tags table. It maps the response into state of that table, such as the protocol, the endpoint config, the resource config, the documentation url, etc. It seems rather natural to me that it would extract authentication metadata from the SpecResponse and store it in this table where the UI is able to inspect it. While we can further abuse the endpoint_config schema for this purpose, I think we should really try not to...

Okay, that helps a bit. We will still end up having some configuration in the endpoint_spec_schema in order to signify where in the spec we want the OAuth credentials to be put, just like the Airbyte connectors do with their specs.

In that case we would have some of the OAuth configuration in a separate part of SpecResponse, e.g. the OAuth provider templates, and some others which are more directly relevant to the spec itself in endpoint_spec_schema, such as where the resulting tokens have to live in.

mdibaiee commented 2 years ago

Did a quick search into what would a CSRF attack look like and how does the state parameter prevent it (also this answer):

I initiate an OAuth login request for Google on Estuary
I am redirected to Google’s login page, e.g.: https://accounts.google.com/o/oauth2/auth?access_type=offline&client_id=ESTUARY_CLIENT_ID&redirect_uri=dashboard.estuary.dev/oauth&response_type=code&scope=email&state=SOME_STATE
I do not fill the login form, instead I send this to a friend of mine, and they open it up and login, and they are redirected to dashboard.estuary.dev/oauth with an access token.
If there is no proper state verification, then my Google connector will have access to my friend’s account.

But if we do have state verification, my friend’s state verification will fail, because they have a different session than mine, and dashboard.estuary.dev/oauth will compare the SOME_STATE with something like session['state'] and it will fail.

mdibaiee commented 2 years ago

I was looking at Omniauth, another (more popular and mature) library for OAuth, and this is the abstraction they provide for OAuth2 providers:

https://github.com/omniauth/omniauth-oauth2/blob/master/lib/omniauth/strategies/oauth2.rb

option :client_id, nil
option :client_secret, nil
option :client_options, {
  :site => 'https://api.github.com',
  :authorize_url => 'https://github.com/login/oauth/authorize',
  :token_url => 'https://github.com/login/oauth/access_token'
}
option :authorize_params, {}
option :authorize_options, %i[scope state]
option :token_params, {}
option :token_options, []

The _options parameters just list what options are available to the developer who is calling OAuth2 provider calls in code, and is not really relevant for us if we use a template.

Basically they let you specify the URL and the parameters for each step of the flow (client: general options for all requests, authorize: step 1, the initial redirection to authorization dialog, token: fetching the access token)

It seems like given that the developer can specify any parameters they want, and be able to extract the response, we will be fine. We can basically let the developer specify which parameters they want to avail of from the response, and we add all of those to the endpointSpec.

mdibaiee commented 2 years ago

Edge Functions

I took a look at Supabase Edge Functions and there are some concerns about them:

Apparently they can't run in parallel, which means they don't scale at the moment (this is a bug and not intended behaviour): https://github.com/supabase/edge-runtime/issues/121
They are currently experimental, up until 1 August 2022, and there are some known limitations and potentially some breaking changes to come. The most important of these limitations for us is that when doing local development, only a single function can be served by supabase (this is not the case when deploying the functions). We can probably work around this by running the functions locally manually using deno.
We have to consider issues unknown at the moment, by the nature of this product being new and immature.

Edge functions have some nice values that they bring:

Supabase handles authorization to call the function itself (e.g. you need to be authorized to call the function in the first place)
It provides us automatically with a service role and our supabase URL which makes it a bit easier to get started

Alternatives

We could instead have a Cloud Run function, and provide it with a service role, and in local env we could set it up manually. We can be sure that Cloud Run functions will run in parallel and we probably won't face some unknown unknowns of immaturity of Supabase Edge Functions. We will need to handle authorizing the request that comes to Cloud Run, which we can probably do by sending an auth request to Supabase and checking permissions, etc.

mdibaiee commented 2 years ago

Multiple OAuth providers for one connector

This question was raised during the discussion yesterday: do we need, and do we want to support multiple OAuth providers for the same connector? We don't have a great example of this at the moment, the closest thing we have is Firebolt, which requires credentials from Firebolt, but also from AWS for the S3 bucket. It's a very hypothetical example beacuse Firebolt doesn't even provide OAuth, but one could imagine the case where these two did, and we might want to provide two OAuth.

One point I have against it is: I would expect most services, that if they want access to some other resource, they handle the authorization to the third service, not us. e.g.:

However, in the case of Firebolt it makes sense, since they expect you to have your data somewhere in some bucket, and you are free to load data from multiple buckets into one database, so they can't really associate the database with a single bucket and do authorization.

If we were to allow multiple OAuth2 providers, we could use the provider as a key for mapping the OAuth2 of a connector for a specific connector between the SpecResponse and the EndpointSpecSchema.

mdibaiee commented 2 years ago

OAuth2 spec in endpoint_spec_schema

We have the separate SpecResponse field for OAuth2 specification which will be used by backend to generate the Auth URL, fetch access token and refresh token. However, for the UI to know to show an OAuth2 field in the connector config, it needs to know where to show it, and what fields need to be filled with the response of the OAuth2 flow. We can use annotations in the endpoint_spec_schema, specifically for the UI to render the form and fill it in. e.g.:

credentials:
      type: object
      title: Authentication
      description: >-
        Google API Credentials for connecting to Google Sheets and Google Drive
        APIs
      oneOf:
        - x-oauth2-provider: google
          title: Authenticate via Google (OAuth)
          type: object
          properties:
            client_id:
              title: Client ID
              type: string
              secret: true
            client_secret:
              title: Client Secret
              type: string
              secret: true
            access_token:
              title: Access Token
              type: string
              secret: true
            refresh_token:
              title: Refresh Token
              type: string
              secret: true

In this case the UI could know given the x-oauth2-provider to render the google OAuth2 button. The rest of the fields will be hidden, and once the response of the OAuth2 flow is received by the UI, the hidden fields will be filled with the values. This allows the connector to select which fields it needs, too. Rather than the fields being filled in magically, the connector developers will know where and which values will be filled in.

dyaffe commented 2 years ago

My only comment on this is that I think it's probably safe for us to assume we will only use one OAuth provider for connectors in the foreseeable future.

jgraettinger commented 2 years ago

We can use annotations in the endpoint_spec_schema, specifically for the UI to render the form and fill it in.

Playing devil's advocate, it's not clear we totally need this, since the UI could also keep a mapping of connector images (e.x. ghcr.io/estuary/source-google-sheets) and their provider-specific button behaviors. The new authorization section of the SpecResponse could also provide JSON pointers of portions of the schema form which aren't directly user-entered, correct?

We hope not to have to create most of these connectors ourselves, so if we can avoid a new requirement on their specifications that seems it would be helpful.

mdibaiee commented 2 years ago

@jgraettinger I think it's more confusing for a connector developer to have some fields magically filled in somewhere, and it's unclear to me how do we decide "where" in the form we show this OAuth? in case of most connectors it's a oneOf between Service Account and OAuth, how does UI realise that? In some other cases it might be the only option, etc.

mdibaiee commented 2 years ago

OAuth is released ✅

estuary / flow

OAuth2 #568

Technical Flow

Implementation Steps

Edge Functions

Alternatives

Multiple OAuth providers for one connector

OAuth2 spec in endpoint_spec_schema