Data Flow Inference Assumptions

As previously discussed in issue #40, it is not always possible to unambiguously determine data flows, given a set of 'Process-uses-Process' and 'Process-handles-Data' relationships. Users can assist the inference logic by inserting 'Process-serves-Data' or 'Process-relays-Data' relationships, but this is not always sufficient especially where Process-uses-Process relationships form cycles.

Four test cases illustrate this:

DataFlow-Test-09s: A service S1 serves data D1, and used by an interactive client C1 and also by a service S4 which accesses S1 via intermediaries S3 and S2. C1 also uses S4.
DataFlow-Test-09s-PlusS: Same as DataFlow-Test-09s, but S2-serves-D1 and S3-serves-D1, making it clear that data can flow to S4 via S2 and S3.
DataFlow-Test-09s-Modified: Same as DataFlow-Test-09s, except D1 is created not consumed by client C1.
DataFlow-Test-09s-PlusS-Modified: Same as DataFlow-Test-09s-PlusS, except D1 is created not consumed by client C1.

The first test case presents several possible options for the flow of D1 from S1 to C1 (direct from S1 to C1, or from S1 via S2, S3 and S4 to C1), and from S1 to S4 (via S2 and S3, or via C1). The basic data flow inference rules assume data takes the shortest path, so data would flow S1-C1 and S1-C1-S4. The second test case adds relationships indicating that data can flow via S2 and S3, and since the paths S1-S2, S2-S3 and S3-S4 are all shorter than S1-C1-S4, the inference rules should deduce that D1 flows S1-C1, and S1-S2-S3-S4 but not S1-C1-S4.

The third test cases poses a different problem. Data must flow from C1 to S1 for storage, but there are two possible paths (direct from C1 to S1, or from C1 via S4, S3 and S2 to S1). S4 can get data from S1 via two routes as before (via S2 and S3, or via C1), but S4 can also get data direct from C1. The last test case adds relationships indicating that data can flow via S2 and S3, which should rule out the path S1-C1-S4, but still leaves two options: C1-S4 (new values created by C1 are sent direct to S4 as well as S1), and S1-S2-S3-S4.

In the first pair of cases, it seems unlikely that data would be forwarded from S1 to S4 via C1 because C1 is an interactive client. An interactive user wouldn't normally be involved in forwarding data between two services. This is built into the construction sequence as a rule, forcing data to go to S4 via S2 and S3.

In the second pair of cases, it seems unlikely that data would be sent direct from C1 to S4, because that would mean the user has to send the same data to both services. However, with the current rules, D1 is inferred to flow direct from C1 to S1 and from C1 to S4, unlikely though this may be. In DataFlow-Test-09s-PlusS-Modified, the relationships S2-serves-D1 and S3-serves-D1 mean data must get to those processes somehow, leading to construction of data flows S4-S3 and S3-S2. Since S2 is acting as a relay, having data D1 reaching but not leaving S2 is interpreted as a modelling error.

The data flow construction sequence does not include rules to inhibit data flows in which users are sending the same data more than once to different destinations. Such rules were not included because, while this would be unusual, it is not unknown. Usually, it is bad design to require data to be entered twice by a user, but most users have experienced systems where this is the case.

Solving this issue is not simple, because there is an ambiguity that cannot be resolved by the model: is the system design so poor that a user must upload the same data to two separate services, or should the data flow take a different, possible (but longer) path?

It seems there are several ways this could be addressed:

Option 1: accept the limitations of the existing construction patterns, and suppress the modelling error for a data relay that doesn't forward the data so it doesn't cause confusion. This means we accept that data flows will not be very realistic in some not uncommon situations, and the user will get no warning of this.

Option 2: same as option 1, but introducing new modelling errors when data flows from a producer to more than one nearest neighbour. The modelling error should ask users to clarify whether the data really is to be sent multiple times, by removing the direct 'uses' relationship with those neighbouring processes, and introducing a separate, (non-interactive) collocated fan-out process between the producer and each neighbour, where the producer 'uses' the fan-out, the fan-out 'uses' each neighbour, and the fan-out 'relays' the data.

The modelling errors could be made sensitive to whether the producer:

is a client of the neighbouring processes (meaning the client must connect multiple times)
does or does not have an interactive user (meaning a human may need to repeat some actions)
does or does not 'enableUserInput' of the data (meaning the user does repeat some actions)

Some situations may then not need a modelling error, e.g., where a non-interactive service produces data and sends it to clients 'on demand' (in response to their requests) - a quite reasonable and commonly used pattern.

The user could obviously disable an inappropriate data flow, and it may be appropriate to allow this as a modelling error control strategy. However, this would not mean alternate data flows will be inferred, so it should only be used if other data flows exist whereby the data can reach all relevant destinations. Disablement triggers a side effect threat which causes loss of availability in the disabled data flow, so it should be evident if this is a problem.

A better solution would be to make the fan-out process specific to the data, so if the process produces distinct outputs, it will need multiple fan-out processes. Then one could leave out 'uses' relationships to some neighbouring processes, thereby preventing data being sent to them via the fan-out. This would force the data flow inference rules to find alternate data flow paths to all necessary destinations.

Option 3: alter the data flow inference sequence, based on new assumptions about which data flows should not be considered. The existing constraint could be retained that interactive processes do not forward data, but other assumptions added just by changing the sequence. For example, we might look for data flows in the following order:

flows from producer processes to data stores, but not through data stores, i.e., producer sends data once to a storage service (unless to multiple, equidistant stores)
flows from those data stores to other stores not yet reached, i.e., changes are synchronised between data stores
flows to consumer processes not yet reached from data stores
flows to consumer processes direct from a producer

This could be done using the existing data exchange model, whereby data can flow in either direction along any application client-service channel, but not through an interactive process. The only change would be in the 'activation' patterns that mark data channels that should carry data flows, so a less realistic channel is not used where a more realistic channel exists.

This change would mean data is fetched from a store by a consumer process if there is a stored copy it can reach, and only from a producer where that is not the case.

Option 4: alter the data flow inference sequence, using new assumptions about where data could flow as well as where it does not flow. This would allow changes to make the presence of data flows sensitive to whether the data is flowing from a client or a service, as well as whether the data is coming from an interactive process.

With options 3 and 4, the risk is that we end up encoding a different set of assumptions that lead to better outcomes in test cases that currently don't work well, but may lead to worse outcomes in other situations. Since we can't test for all possibilities, it may be some time before we know where the new patterns don't work very well. However, for that reason it makes no sense to use option 2 now, and then switch to options 3 or 4 later onIf we need to learn from experience how best to use a new construction sequence, we should make the changes so we can gain that experience sooner rather than later.

Spyderisk / domain-network

Data Flow Inference Assumptions #131