ConsumerDataStandardsAustralia / standards-maintenance

This repository houses the interactions, consultations and work management to support the maintenance of baselined components of the Consumer Data Right API Standards and Information Security profile.
41 stars 9 forks source link

1.13.0 appears to have broken pseudonymity of Pairwise Identifiers #480

Open perlboy opened 2 years ago

perlboy commented 2 years ago

Description

In 1.13.0 ID Permanence statements were modified converting "data recipients" to "Data Recipient Software Products":

IDs MUST be immutable across sessions and consents but MUST NOT be transferable across Data Recipient Software Products. For example, "Data Recipient Software Product "A" obtaining an account ID would get a different result from "Data Recipient Software Product B" obtaining the ID for the same account even if the same customer authorised the access. Under this constraint IDs cannot be usefully transferred between client organisations or data holders.

In addition a statement was added, without the commit message being added to any type of published changelog (https://github.com/ConsumerDataStandardsAustralia/standards/commit/1dbba1b3e3d896ca57baff319d6f79ca60be819b) regarding sector_identifier_uri to Identifiers and Subject Types:

The Data Holder MUST support the sector_identifier_uri in PPID generation according to [OIDC] if this field was supplied by the client during registration.

The referenced OIDC statement includes the following:

When a sector_identifier_uri is provided, the host component of that URL is used as the Sector Identifier for the pairwise identifier calculation.

This fact had been previously outlined in a knowledge base article: https://cdr-support.zendesk.com/hc/en-us/articles/360004324176-Data-Holder-sector-identifier-uri-Support

The net effect of this is that if a sector_identifier_uri is utilised containing a common hostname (like for instance a CDR Providers CDN serving multiple software products) the Pairwise Identifier would be the same. As there is no explicit definition that the sector_identifier_uri hostname MUST be different per Software Product this has the effect that it can be impossible to simultaneously comply with the statements in ID Permanence and those in Identifiers and Subject Types.

This is potentially the first "likely unintended technical consequences" of the ACCC decision to ignore the entity model for Sponsored Data Recipients (https://github.com/ConsumerDataStandardsAustralia/standards-maintenance/issues/427) as there can now be two data recipients (masquerading as two software products) receiving the same pairwise identifiers which would be entirely OIDC compliant and seemingly Standards compliant too.

Area Affected

Change Proposed

1) Mandate sector_identifier_uri hostname be different per software product with immediate effect and enforce such a restriction at the Register level or; 2) Specify that PPID is per software product and remove sector identifier uri logic, reversing previous guidance and changes forced by CTS (ie. CTS 3.2 validates based on sector identifier uri so all installations will be impacted) or; 2) Do nothing and leave the ambiguity up to implementers to decide

CDR-API-Stream commented 2 years ago

This issue has been flagged as a candidate for MI 11

CDR-API-Stream commented 2 years ago

This issue has raised a variety of distinct issues that clearly need resolution.  After internal discussion, we would like to propose the following questions and commentary to help discussion towards changes to the standards that would resolve the situation.

Firstly, it is probably wise, to articulate some of the intent behind aspects of the standards related to this topic.

It has always been the intent of the DSB that the sub value, or PPID, for the same customer would be different for different software products.  This intent arises from the desire to avoid a situation where a single organisation with many software products (for instance, an ADR sponsor) would be able to passively map the behaviour of a single individual across many software products.

The introduction of the sector_identifier_uri was to allow for the redirect_uri for a software product to change over time without breaking all of the existing PPIDs.  This is aligned with the purpose of it being introduced into the normative standards.

It is always the preference of the DSB to lean on recognised, normative, standards wherever possible so that standard vendor implementations can be utilised as much as possible by participants.

In this context we are therefore framing our response to this issue according to the following questions:

1. Should the PPID be subject to the ID Permanence rules outlined in the standards? The standards never state that the PPID should align with the ID permanence rules.  Only the normative standards are referred to.  It is understandable that people reading the standards could infer their application, however.  In reality, the ID permanence rules were developed to govern IDs included in data payloads at a time when the infosec profile was published separately to the API standards.

One potential option to avoid further confusion would be to explicitly call out that the PPID does not need to comply with the ID permanence rules and that only the specific statements in the infosec profile and in the normative standards apply.

Feedback on whether this would be a good idea or would introduce other side effects would be helpful.

2. Should we restrict the sector_identifier_uri hostname to be unique per software product? This restriction is suggested by the CR and would address the issue of the same PPID being used across multiple software products.

As an alternative, the restriction could be applied only to the entire value of sector_identifier_uri.  This would address the issue of allowing a unique set of redirect_uris per software product but would not result in unique PPIDs per software product.

It should be noted that adopting this change would possibly impact the consents of existing ADRs that would need to change their existing sector_identifier_uri values to achieve this.  This would be a negative outcome and we would appreciate feedback on potential ways to avoid this outcome.

3. Should we state that redirect URIs should not be shared across multiple software products whether in direct metadata or via the contents of the file pointed to by the sector_identifier_uri? It has been noted that some ADRs are using the exact same sector_identifier_uri result in shared redirect uri lists across multiple software products.  Effectively, redirect URIs are being presented in a single master list across multiple software products.  This is definitely not seen as a desirable approach and it is certainly not an endorsed approach.

We would welcome feedback as to whether we should explicitly restrict this model in the standards or advise against it via guidance.

perlboy commented 2 years ago

Firstly, it is probably wise, to articulate some of the intent behind aspects of the standards related to this topic. It has always been the intent of the DSB that the sub value, or PPID, for the same customer would be different for different software products.  

Intent is nothing more than a bureaucratic excuse to poorly define something. The Standards bind themselves at Line 8 to RFC2119 and in the absence of a technical adjudication body (that doesn't exist unless we actually want to believe the Regulator is capable) this is all that matters.

This intent arises from the desire to avoid a situation where a single organisation with many software products (for instance, an ADR sponsor) would be able to passively map the behaviour of a single individual across many software products.

A Sponsoring ADR does not implicitly have many software products because it is an ADR sponsor. It sponsors ADRs which have their own software products. The only reason software products belonging to Sponsored ADRs technically live under Unrestricted ADRs is because of Register constraints that were previously highlighted and hitching the wagon to a poor technical execution by the Regulator is fraught with danger.

Legally the software product belongs to the Sponsored ADR with some indemnification provided by the Unrestricted ADR.

The converse here is an Unrestricted ADR with multiple software products of their own. Given a data environment accreditation is across the entire CDR environment and organisation it seems logical to expect that CDR activities would be compartmentalised into a single CDR data store. Allowing an organisation to share infrastructure between its software products makes a lot of sense and this applies just as equally for the inbound base urls (sector_identifier_uri) and the identifiers inside data stores. This technical reasoning matters because many of the ID Permanence identifiers are computed resulting in sometimes very long identifiers (256bit+) as a result, over billions of transactions this has a meaningful impact on data recipient systems and scale.

With appropriate controls (and proper Register separation) of Sponsored ADRs I do not see the reasoning for requiring an ADR with multiple software products to not allow for identifiers to be uniformly shared because the legal construct in which they are operating in covers the entire technology environment.

The introduction of the sector_identifier_uri was to allow for the redirect_uri for a software product to change over time without breaking all of the existing PPIDs.  This is aligned with the purpose of it being introduced into the normative standards.

This statement is in direct conflict with the "intent" stated above.

It is always the preference of the DSB to lean on recognised, normative, standards wherever possible so that standard vendor implementations can be utilised as much as possible by participants.

Except where the DSB "intended" to retrospectively change that preference.

In this context we are therefore framing our response to this issue according to the following questions: 1. Should the PPID be subject to the ID Permanence rules outlined in the standards? The standards never state that the PPID should align with the ID permanence rules.  Only the normative standards are referred to.  It is understandable that people reading the standards could infer their application, however.  In reality, the ID permanence rules were developed to govern IDs included in data payloads at a time when the infosec profile was published separately to the API standards.

Until 1.12.0 the term "data recipient" was used interchangeably to represent a Data Recipient and a Software Product. In fact, in the CDR Federation section alone it was swapped multiple times:

(ADR Context) A Data Recipient MUST be accredited in order to participate in the CDR Federation. Accreditation rules for Data Recipients are beyond the scope of this artifact. (Software Product Context) A Data Recipient assumes the role of an [OIDC] Relying Party (Client). (ADR then Software Product Context) For the purposes of this standard a single accredited organisation may be represented via the Register as multiple separate Data Recipients to support multiple applications or services.

Additionally in the current Standards the:

While the clarity provided of consistently using "Data Recipient Software Product" has long term benefits it was only announced in a changelog and not consulted on. Had it been consulted on participants could have raised the discrepancy. Again the DSBs change process continues to demonstrate it's deficiencies and there is a wide variety of ambiguity on terms used.

One potential option to avoid further confusion would be to explicitly call out that the PPID does not need to comply with the ID permanence rules and that only the specific statements in the infosec profile and in the normative standards apply.

The current Standards state the following:

IDs SHOULD be unique but that uniqueness may be within a clearly bounded context. For example, a beneficiary ID may be unique but only in the context of a specific account. The bounds of uniqueness should be clearly described in the standards definition for the end point. IDs MUST be immutable across sessions and consents but MUST NOT be transferable across Data Recipient Software Products.

Considered in the context of the ID Permanence section not being the same as PPIDs:

  1. The Standards don't clearly describe the identifiers for the end point, rather nearly all identifiers provide a circular reference back to this statement
  2. Consents are not Arrangements (in fact the rules talk about authorisations). This is confused in the Standards in various ways for example:

    (Consent separate from Arrangement) Data Holder MUST revoke any existing tokens related to the arrangement once the new consent is successfully established (Consent seperate from Arrangement) Data Recipient Software Products MAY provide an existing cdr_arrangementid claim in an authorisation request object to establish a new consent under an existing arrangement (Consent equal to Arrangement)_ CX Guidelines Amending authorisation: Authorisation: Amending consent

Strictly speaking it would appear that currently accountId can change per arrangement established which would make life a whole lot easier for Data Holders but problematic for Data Recipients.

Feedback on whether this would be a good idea or would introduce other side effects would be helpful.

I think the key take away here is that the Standards are poorly defining critical terms (or not defining them at all).

From my perspective at least the generation of identifiers should be consistent across the whole Standard. Ie. sub and accountId etc. I don't really mind if it's based on sector_identifier_uri, software_product_id or potentially legal_entity_id (I like this one if things are going to change) but consistency of behaviour would benefit everyone.

2. Should we restrict the sector_identifier_uri hostname to be unique per software product? This restriction is suggested by the CR and would address the issue of the same PPID being used across multiple software products.

For situations where an Unrestricted ADR is Sponsoring I think it probably should be a separate hostname. This would still allow for an Unrestricted ADR with multiple software products of its own to cut across software products for technical optimisation reasons.

As an alternative, the restriction could be applied only to the entire value of sector_identifier_uri.  This would address the issue of allowing a unique set of redirect_uris per software product but would not result in unique PPIDs per software product.

This would violate OIDC at which point it begs the question of "why bother aligning". Either the DSB uses the upstream Standard without modifications or abandons it for its own definition whereby it makes everything based on software_product_id or legal_entity_id and sector_identifier_uri is used only to purvey the acceptable redirect uris. Effectively this would result in an explicit exclusion of OpenID Connect Core 8.1. Pairwise Identifier Algorithm with it being retained only according to the definition in 2. Client Metadata within OpenID Connect Dynamic Discovery.

It should be noted that adopting this change would possibly impact the consents of existing ADRs that would need to change their existing sector_identifier_uri values to achieve this.  This would be a negative outcome and we would appreciate feedback on potential ways to avoid this outcome.

Potential ways to avoid this in the first place were already presented to the government prior to it having an impact and were dismissed by the Regulator because some timeline had been set (I guess Minister Hume was happy at the time). The pain now seems to be of the (former) Governments creation and maybe next time political expediency won't override technical logic.

Regardless of the resultant change it will impact ADRs. Holders are implementing this space very differently across implementations with very different interpretations and it is now very cloudy because of changes made in the past 3-4 releases of the Standards.

3. Should we state that redirect URIs should not be shared across multiple software products whether in direct metadata or via the contents of the file pointed to by the sector_identifier_uri? It has been noted that some ADRs are using the exact same sector_identifier_uri result in shared redirect uri lists across multiple software products.  Effectively, redirect URIs are being presented in a single master list across multiple software products.  This is definitely not seen as a desirable approach and it is certainly not an endorsed approach.

This fundamentally comes down to whether a data recipient is permitted to share data stores between software products. From my understanding this is not only permitted but, for cost reasons, desirable. If the PPID/ID Permanence rules were aligned to software_product_id/legal_entity_id the sector_identifier_uri side of things would be purely a technical mechanism for multiple hostname redirect uris but it would mean ADRs with multiple software products of their own would end up with duplicate data in shared data stores - maybe this is ok maybe it isn't 🤷 .

We would welcome feedback as to whether we should explicitly restrict this model in the standards or advise against it via guidance.

It's really difficult to provide guidance when there are so many ambiguous terms in the space and the DSB talking about intent and clarifications.

spikejump commented 2 years ago

This issue is getting long and difficult to track. Below we will outline our thoughts on PPID and sector_identifier_uri based on the original posting.

As an ADR that supports long-lived consent arrangements for our customers, we have designed our solutions to support our use cases leveraging ID permanence and the ability to use same sector_identifier_uri across different Software Products.

Scenario 1

To start with, it is possible that an ADR can use one or more Outsourced Providers (OSPs) to provide connections to DHs with all OSPs using the same sector_identifier_uri. Let's use the typical product comparison software as an example and let's call it UberComparator. For the market, there is a single softare product called UberComparator. This software does comparison across finance, telco and energy markets. The ADR decides to connect all financial DHs using OSP1; and connects all telco DHs using OSP2 and connects all energy DHs using OSP3. Why does the business do this? There can be many reasons, e.g. leverage OSP expertise, giving out business to existing partners, driving competitions between OSPs, etc.

In this scenario, the ADR registers 3 distinct Software Products with ACCC Registry for the same consumer facing software. One Software Product is named UberComparator Finance Connection, another named UberComparator Telco Connection, and another named UberComparator Energy Connection. Each Software Product has the common sector_identifier_uri. This ensures all PPIDs from all 3 connectors are generated for the same UberComparator application.

This approach allows the ADR to control who to use as an OSP. If the ADR decides that OSP1 is no longer suitable, ADR can replace OSP1 with OSP4 but still using the same sector_identifier_uri. The outcome with such a change is minimal disruption for ADR's customers. That is, PPIDs coming from OSP4 will be the same as those collected by OSP1 and thereby allowing existing customer data to continue without disruption. Granted customers will most likely need to provide consent again but at least their existing data are maintained with continuation.

Another possibility is there's so much traffic going through OSP1 and OSP1 is not able to cope. So ADR decides to introduce OSP5 with a new Software Product registered. Again, OSP5 uses the same sector_identifier_uri. ADR then diverts some traffic to OSP5 from OSP1 and still obtaining all the same IDs.

Scenario 2

Let's say ADR has two software products, one UberComparator and another UberAccounting. ADR registers 2 Software Products with ACCC Registry. Again, same sector_identifier_uri is used.

In this case, one of the benefits for the ADR as well as the DH is less performance overhead. This happens because when ADR needs to get banking transaction data from the DH then only one single call happens (for both softare products). When sector_identifier_uri is different between the 2 Software Products then ADR can't tell they're the same account and so ends up calling the DH twice. But the data returned will be identical. So DH gets overloaded with unnecessary API calls as well as ADR needing redundant capaity to make the call.

Another benefit for the ADR is that data for the same PPID needs to be stored once only. This may not be an issue for simple, one-time use case. But imagine an accounting solution that deals with tax where the data needs to be kept for a minimum of 5 year and in some cases more than 5. These data adds up.

Final Thoughts

Finally, it is important to note that without PPID, ADRs can't properly identify (matching) data from one consent arrangement to another. Any slighlty sophisticated application using these data will be impacted. Consumers will be the ones ultimately paying for the price of no ID permanance.

We are fully supportive of PPID being what it is today - following ID Permanence rules. We are also fully supportive of DHs generating PPIDs based on sector_identifier_uri when it is supplied. The later is critical in our opinion.

We do not wish to see further control/limitation on the use on this front. In our opinion, further control/limitation is overreaching for the ecosystem and will only stymie innovation.

CDR-API-Stream commented 2 years ago

To help the analysis of this issue, a rules clarification has been provided:


There is no strict prohibition on an ADR sharing CDR data across a range of goods or services they offer.

Sharing between the ADR’s apps would be considered a ‘use’ of the CDR data. It is important to note however, that the ADR’s use of the data must be consistent with the consumer’s consent, and must not be beyond what is reasonably needed in order to provide the requested goods or services. A relevant requirement is the Data Minimisation Principle (DMP) in CDR Rule 1.8, which applies when an ADR is seeking to collect CDR data and when an ADR is using CDR data to provide a good or service to the CDR consumer.

The above does not apply to sharing of data between sponsors and affiliates, and principles and CDR representatives. These relationships are made up of separate legal entities and therefore sharing of data between them involves a disclosure.


The grouping of software products around shared sector identifiers, to obtain consistent identifiers, is not a requirement to facilitate valid data sharing scenarios. Conceivably, data can be shared between multiple software products if the data recipient maintains mappings of identifiers per subject.

The above prompts the following questions:

  1. Can we assume that the consent to disclose data between an ADR's software products are not reflected on the data holder dashboard?
  2. What technical or operational mechanisms, if any, should be considered to ensure sponsors and affiliates as well as principles and CDR representatives, are not sharing data between software products owned by different legal entities?
  3. Is further guidance required to help ADRs structure their software products for data collection?
spikejump commented 2 years ago

@CDR-API-Stream

Can we assume that data disclosed between an ADR's software products are not reflected on the data holder dashboard?

Not sure what this question is really asking. It reads as if the question is assuming that ADRs will get authorised consent once for one software product and disclose them to all other software products the ADR owns. This is not what we're saying.

What we were referring to the "sharing" of data between software products does not facilitate this kind of disclosure. Customers have to individually provide consents to data sharing with the DH for each software product. The DH will record. for example, two consents from customer X for software product S1 and S2. The optimisations the ADRs can do are:

  1. Keep a single copy of the accounts data and disclose the single copy to S1 and S2.
  2. ADR makes one Get Accounts call to the DH for the relevant authorised accounts instead of two (due to the 2 separate consents) during daily refresh.
  3. (May be something else?)

Please note that customer X will still see two distinct consents on the ADR's CDR dashboard - not one. The two consents are what customer will see on DH's dashboard as well.

What technical or operational mechanisms, if any, should be considered to ensure sponsors and affiliates as well as principles and CDR representatives, are not sharing data between software products owned by different legal entities?

We can't comment as we have not looked into this operational model.

(The term "different legal entities" is a bit confusing. DH? May be this can be clarified.)

Is further guidance required to help ADRs structure their software products for data collection?

As long as there are no rules violation, we feel how ADRs structure their software products is a private affair.

CDR-API-Stream commented 2 years ago

Further understanding and collaboration is required to understand the problem space and the associated participant behaviour. It can then be determined whether technical or operational mechanisms, if any, are required.   Therefore, this issue is deferred to a future maintenance iteration.

biza-io commented 2 years ago

We note this issue has been added to the intended issues for discussion on MI12. Due to resource constraints associated with multiple future dated obligations and Energy sector activation Biza is unable to provide the further comment, evaluation or elaboration we feel will be necessary to appropriately resolve this issue.

As a consequence we request this issue is deferred until a later maintenance iteration.

CDR-API-Stream commented 2 years ago

At Biza's request this issue has been removed from MI12.

biza-io commented 1 year ago

Our initial thoughts on this problem still stands and over time we have come to a firm position to recommend Option 2 but we believe that the CTS currently expects Option 1 while the Register does not enforce the uniqueness constraint.

biza-io commented 1 year ago

As per last guidance at MI, please consider this for a Decision Proposal. Existing implementations appear to either be technically non-compliant or potentially violating Rules expectations.