ConsumerDataStandardsAustralia / standards-maintenance

This repository houses the interactions, consultations and work management to support the maintenance of baselined components of the Consumer Data Right API Standards and Information Security profile.
41 stars 9 forks source link

Raising of Traffic Threshold NFRs specified in the CDS #541

Closed jimbasiq closed 2 years ago

jimbasiq commented 2 years ago

Description

Basiq would like to raise concerns and propose a review and uplift of the currently specified Traffic Thresholds NFRs for CDR Data Holders. We believe the current limits are too low to support a data recipient serving the Australian consumer.

Can I please propose this topic as a priority for the Maintenance Iteration 13 starting in a couple of weeks?

Area Affected

To provide you with some detail to hopefully validate this as a worthwhile topic, our primary concern is rate limits for refreshing of data for all consumers for a given institution (Data Holder) for a given software product (Data Recipient):

We have a specified limit for unattended traffic per software product id, these are specified for outside of business hours, while for business hours the "best effort" is expected which means rates could be even lower. A Data Holder we have been working with has already confirmed we were hitting their limit for private endpoints and that limit is 50 TPS which is a limit that they will continue to use unless the CDS specifies a higher throughput.

In order to understand the rate limit and current limitations, here is an example of all requests we are sending within one data refresh job for one consumer: GET access token (sent once - we are not 100% sure if this counts in rate limit) GET the list of accounts GET the list of balances GET account details - should be targeted only if account.detail scope is present (number of requests equals the number of accounts) GET transactions - should be targeted only if transaction.detail scope is present (number of requests is greater or equal to the number of accounts - in order to simplify let’s say that it is equal) GET customer details (sent once) This means that if we have all required scopes, we are sending at least 2*(n+2) requests, where n equals the number of accounts.

Now let’s imagine a perfect scenario where we are sending 50 requests per second each second in a day and let’s see how many jobs we could do depending on the average number of accounts per job:

n=3; 86400 50 / (2 (3 + 2)) = 432000 jobs in total during a day

n=5; 86400 50 / (2 (5 + 2)) = 308570 jobs in total during a day

300-430k connections for one software product for any of the big 4 banks is not enough and the example is a perfect scenario. Being realistic the real scenario could be 5-10 times worse than the perfect scenario, we see this as a serious limitation for several of our Partners.

Change Proposed

The largest Data Holder in Australia has just under 18m customers. It is very feasible that a successful Australian Fintech could attract half of Australian consumers, that would mean 9m consumers.

Rounding up to 10m (33 times higher than the current 300k the current TPS limit) to allow some growth I have 2 proposals:

  1. All Major banks need to adhere to Traffic NFRs for unattended traffic per software product id of 1,650 TPS (= 33 * 50). Non Major Banks to have a lower TPS - To be negotiated.
  2. The TPS provided by a Data Holder is calculated as a ratio to their customer base. e.g. 15 TPS per 100k customers.

Hopefully the issue is clear, please let me know if not and I can elaborate further.

dpostnikov commented 2 years ago

@jimbasiq What's the actual customer use case requiring frequent refresh with complete re-load?

jimbasiq commented 2 years ago

Hi @dpostnikov, We have many partners/customers on our platform serving Australian Consumer with services such as PFM (Personal Financial Management) or Wealth/Investment roundups. Both are absolute must have use cases for the Consumer data being up to date.

jimbasiq commented 2 years ago

It is also worth mentioning that the current Web Scraping Connections we provide to our Partners/Customers are able to support hundreds of thousands of refreshes in a 24 hour period. It is hard to encourage our Partners to move over to CDR Open Banking connections if there is a severe degradation of service capacity in doing so.

ShaneDoolanAdatree commented 2 years ago

Adatree supports this request. We've detailed a similar experience in #534. Asynchronous collection of data is an often used pattern with obvious benefits.

As CDR grows, more users mean more requests. Competing priorities will emerge if all refreshes must only occur during a customer present session. Consumer facing apps typically have a high traffic period so ADRs and data holders can expect huge spikes in traffic during those periods if customer present is the only real option (which is the case right now). Asynchronous collection avoids this by spreading load across a sensible timeframe. Real-time collection is not required in all cases.

It also allows for a cached fallback for data holder unavailability during a customer present session. If the ADH is not available for a real time call the latest data presented to the consumer is not stale to the point of being unusable i.e. the balance or transactions list might have been fetched an hour ago as opposed to 24 hours ago.

All of this results in better consumer outcomes regardless of use case by providing consumers with a more resilient CDR ecosystem.

perlboy commented 2 years ago

There's a number of issues raised in this thread that are worthy of being broken out.

NFR Suitability

The core focus of the original thread is one around "raising" the NFR thresholds. While this may seem like the right approach the reality is that it likely isn't. The method of describing the threshold seems inappropriate and penalises both small data holders and successful ADRs alike.

Biza.io raised this in DP208 and instead suggested that the NFR be bound to the number of active arrangements at a particular holder. Biza.io also requested usage data to make an evidence based decision. That is to say, Holders gained the benefit of being able to correlate real usage with requirement, could integrate it into their capacity management planning and could therefore design solutions that could correlate 1:1.

This would, by and large, resolve the upper bound problem because the upper bound would be relative to arrangement count. An ADR would get guaranteed throughput per arrangement, even if the TPS on this was lower overall parallelisation could compensate. Additionally, Biza.io outlined a number of implementation patterns we had observed Holders implementing to provide the DSB and broader industry knowledge around the challenges Holders face when weighing up cost vs. capability. As a nascent ecosystem the CDR has a very low utilisation which makes it quite difficult to justify huge capital expenditures in the smaller end of town.

Despite these suggestions, and in the face of numerous opposition, coupled with the participation of a number of ADRs (RAB, Xero, Intuit) but not those who are involved in this thread, the DSB binded the NFRs "as-is" with immediate effect. It would appear that the ADRs involved in this thread are now realising the same challenges others on both Holder and Recipient side identified.

As a result of this decision organisations have now made architectural decisions on this basis and consequently any alteration of the defined NFRs is now likely to be a long dated FDO - it would inappropriate to do otherwise.

Implementation Suitability

There is a reference in the original thread for a "data refresh job". This seems to imply a batch process which essentially resets a complete data set on a daily basis. In essence, a synchronous interface (non batch API) is being used to complete asynchronous activity. This is not only architecturally unsuitable, possibly as a hangover of applying existing collection approaches (ie. screen scraping) to the CDR I would also question the appropriateness with respect to recipients data minimisation obligations. Put another way why is a full batch run being done on all endpoints rather than requesting (and keeping hot) only data which has been requested by the Consumer themselves.

Nonetheless, assuming there is justified reasons for obtaining all of the data it seems inappropriate to be doing this even daily. I believe this is the context for the question @dpostnikov posed. Additionally the scenario for comparisons was described as "perfect" when that seems like a stretch.

Taking the use case given and assuming unattended behaviour (ie. the Consumer isn't waiting around).

Calculations

I'll stick with n being the number of accounts. I'll also stick with the AT being part of the threshold but personally I don't think it should be. There's no reason why an authorisation server can't produce many ATs and penalising the ADR because a Holder chose a low AT lifespan isn't appropriate. I know in Biza.io case we don't include what we consider administrative actions in our traffic thresholds, we only apply thresholds on APIs attached to source systems - which we believe is the thing the NFR upper bounds are intended to protect.

First Run

This is the absolute worse case scenario because it involves a completely new Consumer coming onboard with zero prior data and retrieving every detail.

1 x Access Token 1 x List of Accounts 1 x List of Balances n x Account Details n x Transactions. Assuming 1000 tx is enough, in our observations 2 years worth of history is less than 10,000 tx and that would be a very "busy" account. I've followed the OPs idea of 1:1 with account. 1 x Customer Details

Result: 4 + 2n

Taking the OPs idea of 50 TPS limit there is a total of 4,320,000 API calls to be made.

n=3: 4,320,000 / (4 + 6) = 432,000 sessions per day n=5: 4,320,000 / (4 + 10) = 308,571 sessions per day

🥳 Huzzah, the numbers align with the OP but what's important here is that it represents the absolute worse case of doing a full load of all data in the background every day. I disagree with the statement the "real scenario could be 5-10 times worse" because separate partners should have separate software products but maybe I'm not following something.

Incremental Detail Calculation

Let's now assume we want to maintain the same level of detail but optimise and we have all detail scopes. We don't need to do list accounts because list balances will give us account identifiers and account details has the same detail.

1 x Access Token 1 x List of Balances n x Account Details n x Transactions. This is very likely to be acceptable and quite possibly high performance if checkpointing is used 1 x Customer Details

Result: 3 + 2n

n=3: 4,320,000 / (3 + 6) = 480,000 sessions per day n=5: 4,320,000 / (3 + 10) = 332,308 sessions per day

No Detail Calculation

Let's assume after the first run or because we haven't been provided detail scopes we have no detail at all. This appears to be most aligned with a pure PFM use case especially if the Recipient has aligned its use of the PRD Data and productName and Holders are aligning it too because much of the account specific detail can be derived.

I've left the list of accounts in here still but this could be stripped further or called less than once a day as list of balances contains accountId for transactions call anyway.

1 x Access Token 1 x List of Balances 1 x List of Accounts n x Transactions. This is very likely to be acceptable if checkpointing is used.

Result: 3 + n

n=3: 4,320,000 / (3 + 3) = 720,000 sessions per day n=5: 4,320,000 / (3 + 5) = 540,000 sessions per day

Eventually Consistent Detail

The reality is that of a total sample set there are very few Consumers who will actively engage with an app every day. If they do it's because of a prompt driven by the value proposition and possibly this can be enabled by the CDR in a different way (ie. a shared signal of account changes etc). On this basis being eventually consistent, especially in an unattended scenario seems like it should be appropriate.

On this basis I'll hypothesise, after initial load, the following:

1 x Access Token 1 x Access Token 1 x List of Balances 0.2n x Account Details n x Transactions 0.3 x Customer Details

Result: 3 + 1.5n

n=3: 4,320,000 / (3 + 3) = 720,000 sessions per day n=5: 4,320,000 / (3 + 7.5) = 411,428 sessions per day

Eventually Consistent No Detail

Same concept as above but this time we don't need detail updated continuously. Realistically updating detail could occur as a Consumer present call.

On this basis I'll hypothesise, after initial load, the following:

1 x Access Token 1 x List of Balances 0.3 x List of Accounts 0.66n x Transactions.

Result: 2.3 + 0.66n

n=3: 4,320,000 / (2.3 + 1.98) = 1,009,345 sessions per day n=5: 4,320,000 / (2.3 + 5.5) = 553,846 sessions per day

Suitability

Without real usage data of the ecosystem it is difficult to assess what is "not enough" but suffice to say some basic optimisation appears to double the upper bound. In 2020 Frollo had 100,000 customers and represented 90% of the utilisation. It's unclear if the demand has 10x'd in 2 years hence the desire for usage data to inform the decision.

Alternatives

To me the NFR discussion seems to be more symptomatic of a broader set of problems including:

  1. Recipient implementations designed for historical batch retrieval being used in a live API context
  2. Lack of batch job lodgement and retrieval capability. This seems likely to become much more relevant in Energy context because some C&I customers have literally thousands of accounts impacting the overall scalability of a poll based system.
  3. Lack of event signalling mechanism. The DSB has mooted the use of SSE for this but it's a long way off.

I think overall the concern I have with simply increasing the NFRs is that it is simply patching over features of the CDR that aren't yet present. This seems to also be combining with the need for recipients, many of which have come from a batch based bank feed or cache based screen scraping environment, to change mindset and build solutions which align with best practices in a CDR context rather than wedging CDR into existing approaches.

Put another way, it seems a higher power to weight ratio to focus on feature capability to resolve the underlying problem versus forcing endlessly higher performance requirements that will simply be revisited over and over again.

dpostnikov commented 2 years ago

``

Nonetheless, assuming there is justified reasons for obtaining all of the data it seems inappropriate to be doing this even daily. I believe this is the context for the question @dpostnikov posed. Additionally the scenario for comparisons was described as "perfect" when that seems like a stretch. Exactly, @perlboy you get me.

"PFM" can be designed in so many ways, more efficient or less efficient ways.

Unnecessary calls aside, I agree, there is definitely scalability issue with the current design (both CDR framework and data recipient design as a result).

The way to solve this problem is not to get a bigger hammer or build a bigger pipe (e.g.: replicate a batch design via APIs or increase thresholds). Secure event notification mechanism is missing and should probably be prioritised to solve for these use cases.

ShaneDoolanAdatree commented 2 years ago

@perlboy absolutely fair point on our lack of participation on this topic before now. An arrangement based approach makes sense so as not to require all implementers to provision excess capacity "just in case" when the practical reality is throughput thresholds will only really be tested with the majors. A reasonable FDO is also not something we'd complain about given this feedback is after the fact, but it is feedback based on metrics not theory so I would hope it would be considered valuable even at this stage.

RobHale-Truelayer commented 2 years ago

Just to chime in with another dimension - that of the DH customer profile. Not all DHs are equal, even within a single industry or industry vertical. Some banks focus on lending rather than transactional accounts. Loan accounts have low transaction volumes - typically a monthly interest charge and perhaps one or two monthly payments. Sometimes there might be redraws or deposits but these aren't typical. Whereas a transaction account might have 50 or more times this volume. Profiling DHs ahead of imposing NFRs might be beneficial. The activity around non-bank lender participation is a case in point - personal loans would fit into the low volatility category. When combined with comparatively low customer volumes, it seems inappropriate to impose the same NFR thresholds on both categories of DH...

jimbasiq commented 2 years ago

Hi All,

It is great to see we seem to have a general consensus on the current traffic rates being inadequate.

I'll look forward to discussing with you all on Wednesday what would be adequate and how the rate NFRs could differ dependent on industry vertical, DH size (members, loan book, other) and other ideas. With my ADR hat on I'd like to see a fair usage rate, with my DH hat on I'd like to not cripple the little guys with unfair obligations.

Let's not forget lack of penalties and the caveat of "best efforts". Both could damage businesses IMO.

CDR-API-Stream commented 2 years ago

A Decision Proposal is required #92 DSB Item - Reassess Non Functional Requirements has been added to DSBs future-plan backlog.

CDR-API-Stream commented 2 years ago

Closing as this issue will be considered as a Decision Proposal, see comment above.