awslabs / sbt-aws

SaaS Builder Toolkit for AWS is a developer toolkit to implement SaaS best practices and increase developer velocity.
Apache License 2.0
90 stars 16 forks source link

feat: Interface updates for billing user Read and Update operations #12

Closed twosdai closed 2 months ago

twosdai commented 3 months ago

Reason for this change

Based on prior discussions we wanted to discuss the billing interface further. The Billing interface as is, cannot handle changing a customer's enrollment in an offering, without the full deletion and then creation of a new customer. Ideally, a SaaS customer should be able to upgrade their offering seemlessly, and the billing integration should have options to handle this.

Additionally, if a SaaS customer needs to update their Billing Address, Business Name etc... without an update function this is not a very seamless operation and would require a lot of back channel work.

Finally, there are cases where the SaaS business will want to read the state of the SaaS customer, and then have application level changes. such as displaying a banner, stopping the creation of additional resources, or pausing the use of currently running resources. Without a read function, understanding current state of the SaaS customer is difficult to determine for the SaaS business.

Description of changes

I have added two new IFunctions to the IBilling interface. readUserFunction and updateUserFunction.

Description of how you validated changes

These are added interface changes, there are no tests inside the repo currently testing the billing interface, so I did not add any.

Checklist


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

twosdai commented 3 months ago

@tobuck-aws @madcommit I wanted to provide this PR as a starting point for some of the interface updates that we need to make, let me know your thoughts when you get a second.

I believe the addition of the read and update functionality, is cross cutting enough for any standard billing provider, for instance stripe has full CRUD functionality in its API and system for any customer action.

In a seperate PR, or space we may want to discuss if you're thinking of handing a standard response structure for the reads, which is not included in this PR since I figured that would be a more significant conversation.

suhussai commented 3 months ago

@twosdai , thanks for submitting this PR!

I think the addition of the updateUserFunction to the interface works well with how we are currently using the IBilling interface. For example, here's the line where we hook up the createUserFunction to the DetailType.PROVISION_SUCCESS event, so that when a tenant's provisioning is completed, we go ahead and create a billing user/customer for that tenant.

https://github.com/awslabs/sbt-aws/blob/76ec9a1598e5f395a37c75c9863a64c52ab596c3/src/control-plane/billing/billing-provider.ts#L43-L46

I can see us doing something similar for updateUserFunction, but the use-case with readUserFunction is slightly different. It's not a simple async, fire-and-forget kind of a use-case. There needs to be someone listening to the response. We'll need to do some brainstorming to see how we can support that. There might be an internal API GW of some sort that will be available only to other components to provide that kind of synchronous functionality.

twosdai commented 3 months ago

I think the addition of the updateUserFunction to the interface works well with how we are currently using the IBilling interface.

Would we also want to update the DetailType to take in a new type of PROVISION_UPDATE https://github.com/awslabs/sbt-aws/blob/76ec9a1598e5f395a37c75c9863a64c52ab596c3/src/utils/event-manager.ts#L19

I can add in the implementation of it in this PR if so. Let me know your thoughts on this @suhussai


but the use-case with readUserFunction is slightly different. It's not a simple async, fire-and-forget kind of a use-case. There needs to be someone listening to the response. We'll need to do some brainstorming to see how we can support that.

Some internal API functionality would be nice for decoupling the SaaS from the SBT framework.

Initially I was thinking that the readUserFunction would be primarily used directly in the application code of the SaaS. They would have ability to instantiate a billing client which conforms to this interface and then simply call the readUserFunction and then process the response inside of their application. The main drawback being that whatever programming language the client is implemented would have to be worked around potentially by the SaaS.

But the upside being that we could decouple the potentially lengthy internal API discussion from the addition of the read feature in this interface.

suhussai commented 3 months ago

Yes, we'll need to add a new DetailType to the list of events we currently have and then update the BillingProvider to hook up the new event to the updateUserFunction function. Regarding the event, I'm not sure that PROVISION_UPDATE will work. In my mind, the PROVISION_* set of events are for the first instantiation of a tenant's resources/infrastructure. We'll need to add another event (such as TENANT_UPDATED) that the updateUserFunction can listen for and action on.

Regarding readUserFunction, where would you suggest we surface that functionality? Also, what's a typical use-case this will solve? Just trying to get more info so we can have a more fruitful discussion.

suhussai commented 3 months ago

I was just re-reading the first comment, and I think we can support a lot of the use-cases you mentioned via webhooks. For example:

Based on prior discussions we wanted to discuss the billing interface further. The Billing interface as is, cannot handle changing a customer's enrollment in an offering, without the full deletion and then creation of a new customer. Ideally, a SaaS customer should be able to upgrade their offering seemlessly, and the billing integration should have options to handle this.

We've often advocated for the billing service to be the single source of truth for these kinds of tenant attributes (such as their plan, or whether or not they've paid their bills). So, it makes sense for this information (like tier changes) to flow from the billing provider down to the app, which can pick up the changes and react as necessary via a pre-built webhook.

Additionally, if a SaaS customer needs to update their Billing Address, Business Name etc... without an update function this is not a very seamless operation and would require a lot of back channel work.

I would argue that this should also be done on the billing provider side, and a webhook should be triggered to let the app know that a customer's billing-related fields have been modified. (See customer.updated Stripe event)

Finally, there are cases where the SaaS business will want to read the state of the SaaS customer, and then have application level changes. such as displaying a banner, stopping the creation of additional resources, or pausing the use of currently running resources. Without a read function, understanding current state of the SaaS customer is difficult to determine for the SaaS business.

Information relevant to the operating of the app (such as tenant-tier) should be placed in the user JWT (as a custom-claim, for example) for quick, latency-free lookup. Pausing access due to non-payment should be serviced via a webhook with that information being "pushed" from the billing provider to the app instead of the app "pulling" that information from the billing provider.

@twosdai , does that make sense or am I off base here?

(@tobuck-aws , please chime in here if I said something that doesn't jibe with our long-term vision.)

twosdai commented 3 months ago

I understand your argument and you have some good points and generally I agree that a well designed system is primarily going to use the webhook interface to understand important billing related events.

We've often advocated for the billing service to be the single source of truth for these kinds of tenant attributes (such as their plan, or whether or not they've paid their bills). So, it makes sense for this information (like tier changes) to flow from the billing provider down to the app, which can pick up the changes and react as necessary via a pre-built webhook.

In the instance where all updates to tenants are provided via webhooks, the SaaS app will need to go outside of the framework to provide the user onboarding/plan change page information. In paigo we have a No-Code option for this which a business could use, Stripe additionally has a hosted change plan page option as well.

However now when the end-user (tenant) clicks upgrade/downgrade, there is an async process which would occur for the webhook to eventually be received by the SaaS and any provisioning changes or application state changes due to an enrollment change are a background process. Which is not the usual experience for many users of modern SaaS applications, typically when a user clicks upgrade / downgrade in a SaaS application the appropriate features are revoked/added immediately.

Mabye more simply put, I believe there exists some instances and cases within a SaaS application where the app may really want a process to be synchronous / strongly consistent so as to prevent abuse of the platform. By relying on webhooks solely for communication of enrollment state changes in particular it places more burden on a SaaS app logic to prevent abuse.

Information relevant to the operating of the app (such as tenant-tier) should be placed in the user JWT (as a custom-claim, for example) for quick, latency-free lookup.

This implementation usually isn't done since the JWT is stateless and typically most systems do not revoke JWTs (or handle revoking gracefully) if they are to be canceled. So by supplying billing information that would be used to determine features in a JWT, the end user could have access to their tier for longer than expected, since the JWT expire times is going to be different than any application state change. Additionally I don't know of any billing provider who currently communicates client relevant billing information via a JWT.

Pausing access due to non-payment should be serviced via a webhook with that information being "pushed" from the billing provider to the app instead of the app "pulling" that information from the billing provider.

This makes sense for a best practice design. However I am still left wondering if the existence of webhooks really precludes the interface from enabling a SaaS business at any time to get a customer's billing information. Its a well used pattern that most of our customer's and ourselves follow today.

suhussai commented 3 months ago

Mabye more simply put, I believe there exists some instances and cases within a SaaS application where the app may really want a process to be synchronous / strongly consistent so as to prevent abuse of the platform. By relying on webhooks solely for communication of enrollment state changes in particular it places more burden on a SaaS app logic to prevent abuse.

I agree with what you're saying here.

This makes sense for a best practice design. However I am still left wondering if the existence of webhooks really precludes the interface from enabling a SaaS business at any time to get a customer's billing information. Its a well used pattern that most of our customer's and ourselves follow today.

I think I understand what you're saying, but just to be clear, we want to include the option of having webhooks (for use-cases like tenant non-payment) and synchronous updates from the app-plane to the billing-provider (for use-cases like tenant changing tiers). Assuming that's correct, I'm onboard with the idea.

With regards to implementing this change, I'll lay out what's currently happening, what we're missing and then throw out an idea.

Right now we have an external API GW that handles our sync use-cases (like creating a new tenant or user) and we have an eventbridge that handles internal, event-driven use-cases (like kick-starting the provisioning process for a newly created tenant).

What we're missing is a mechanism that allows us to service sync use-cases for internal services.

The simplest solution I can think of is an internal API GW that is authorized via IAM and is accessed by internal components. An endpoint here can serve requests such as GET /tenants/<tenantId>/billing or something similar but for users. We'll need to figure out details like how to share the endpoint with all of the internal components, but that's what I'm thinking.

Any thoughts on any of this?

twosdai commented 3 months ago

I think I understand what you're saying, but just to be clear, we want to include the option of having webhooks (for use-cases like tenant non-payment) and synchronous updates from the app-plane to the billing-provider (for use-cases like tenant changing tiers). Assuming that's correct, I'm onboard with the idea.

Yes this is what I am thinking as well, you're correct.

The simplest solution I can think of is an internal API GW that is authorized via IAM and is accessed by internal components. An endpoint here can serve requests such as GET /tenants//billing or something similar but for users. We'll need to figure out details like how to share the endpoint with all of the internal components, but that's what I'm thinking.

The SaaS could just use the "client" that implements an internal component. so for example inside a SaaS app they would have some logic that might look something like

import billingClient from "@paigo/sbt-billing"

// Insert any relevant other logic inside their application

// At some point create the client, with relevant secrets, and make any sync calls they want.
const instanceOfClient = new billingClient({accessKey: "1234", "secret": "abc"})
const res = await instanceOfClient.readUserFunction("myUserId123");
// Do something with the response

The RestAPI for the SaaS is great for decoupling the SaaS from the language implementation of the SBT framework, but for a first step, just having the SaaS access the "client" might be the absolute easiest thing to do.

If we want to go down having the internal api-gw and have the SaaS use the REST api for all interactions, we'd additionally, just need to have some sort of internal service discovery like you're suggesting that internal components would report what endpoints exist, expected response structure, and methods.

For the record, I'm not against having an internal api-gw, and service discovery like you suggested, but I think its a separate conversation from just agreeing on the interface for this component.

@suhussai Let me know your thoughts.

tobuck-aws commented 3 months ago

@twosdai first off, thanks for the PR. For the record, I think we're all aligned with the need for the interface you're describing, we're challenged with how to actually implement it. I just got off a meeting with our core team of folks and we still don't quite have the answer, but we're working on it. For some background, when we set out to design SBT, we gravitated to using an event-based paradigm because it scales well, and let's us plug in new capabilities decoupled from one another, such that new feature could be deployed without potentially adding awareness of that feature to existing code that could care less.

The billing stuff is a perfect example. Currently, when the control plane onboards a new tenant, the first thing it does is sends a message to EventBridge indicating this intent. The application plane tunes into this message, because it has a part to play in onboarding (specifically provisioning new infra, if needed). If the application uses billing, the control plane also needs to listen to this message, because it has to ensure all of the appropriate billing plumbing is hooked up when a tenant comes online. As this was done in an event-based model, the onboarding process didn't have any imperative code that we added to, and redeployed. The new billing code could just drop into the same environment, and tune into the existing flow of messages, and do its thing based the events that were raised.

That said, we ourselves (and as detailed in this thread) have run into situations where it feels like we need "synchronous-like" behavior. Case in point:

Mabye more simply put, I believe there exists some instances and cases within a SaaS application where the app may really want a process to be synchronous / strongly consistent so as to prevent abuse of the platform. By relying on webhooks solely for communication of enrollment state changes in particular it places more burden on a SaaS app logic to prevent abuse.

In essence, we have a couple of problems we're wrestling with at this moment:

  1. There is functionality in the control plane that's expected to be called by internal audiences, external audiences, and perhaps both
    1. Take, for example, some imaginary UpdateTenant code. This code might back a REST API, perhaps PUT /tenants/tenant123, but also might be the target of an EventBridge message received by the control plane in response to PROVISION_SUCCESS (maybe we have to update the tenant status from Provisioning to Provisioned, for example). The first one is an external caller, the second one is entirely internal. Ideally we'd just invoke the same code, irrespective of what door it came through, but given the nature of caller authentication we're challenged here. The EventBridge message might use IAM Auth, whereas the API-gw uses a custom authorizer (maybe decoding a JWT), but we can't mix and match. An API-gw can have only one authorizer, but not both
  2. As I detailed above, we sometimes need a synchronous-like interface, to a feature that's implemented asynchronously.
    1. It's not as simple as just throwing up an API Gateway in front of it, and serializing the otherwise parallel functionality of the asynchronous code for the reasons described in the first point.
  3. The IFunction interface may be too course-grained. It's effectively an abstraction over a Lambda (that is, just code in the sky), it says nothing about the expectations of that code.

I guess the tl;dr of my comment is we're hesitant to just merge the interface without having some understanding of how we'll map this back to the reality of the SBT control plane, when we solve the problems cited above. We have another meeting on the books and will hopefully have some ideas afterwards. I'll update this thread with what comes of that meeting.

twosdai commented 3 months ago

Ok great @tobuck-aws Thanks for letting me know where this is heading. The basic end state Paigo needs in order for us to justify working on the implementation is to have the basic CRUD functionality defined in the interface (or interfaces whatever the architectural pattern is), along with the major business cases thought through for the system.

The biggest being from a billing perspective:

  1. New user onboarding
  2. User Upgrades
  3. User Downgrades
  4. User leaves
  5. User reading current billing information a. Usage (How much have they consumed this billing cycle) b. Current plan information, and past invoices
  6. Validation of User State (Payment, Past due invoices etc...)

    While going through any internal discussions, I think if these are kept in mind and have fairly straight forward answers for a business who is using SBT, then billing for the main cases will be covered.


The biggest case for a strongly consistent flow, in my perspective, would be within a new user onboarding, and the validation of the user state. Specifically, preventing provisioning of resources due to a validation issue, and preventing the provisioning of resources due to a missed step with onboarding.

I'm not trying to be obstinate with this PR, we just need to have some clear answers for this base business cases before I can justify implementation. I'm mainly worried about having to do a bunch of rework if we get started without a clear picture of how these will function.

tobuck-aws commented 3 months ago

@twosdai, we're on the same page. Thanks for the list, and apologies for the delay here. We'll be back in touch shortly on the path forward. I really appreciate your patience!

tobuck-aws commented 2 months ago

Closing this for now, as we have a new PR coming that addresses the UpdateCustomer and GetCustomer workflows detailed by @twosdai.