Refine model of data relays

mike1813 commented 9 months ago

A DataRelay (subclass of DataAccess) represents access to data by a Process that forwards the data without processing it. This is specified in a system model by asserting that a Process serves Data that is not stored locally. The goal (in system-modeller usage terms) is to 'pin' the flow of data so it goes through the relay and not via some shorter path that may exist between communicating processes.

If the relaying Process is of type DB, it means data requests and data flows are formulated using a query language which is processed by the DB. In that case, when the Process receives a request it is assumed to process the incoming query to determine what data or queries should be sent on to the data storage service (which may be another DB). Similarly when a response (containing data) is received, the DB may need to organise or filter it appropriately as a response to the inbound request query.

What this means is that when a DB serves data that is not stored locally, it acts as a relay between two flows containing the same type of data (i.e., with the same system role - related to the same system Data asset subclass). The two data flows are nevertheless separate, but related by being inbound and outbound flows to the same Process.

If a relay is not of type DB, then it simply forwards data. This makes it equivalent to other (implicit) data intermediaries. The data flow goes through the process - there isn't really a separate data flow on either side.

However, at present, a DataRelay asset is created even for this situation. This is necessary because it is used to infer the path taken by the data, and specifically to ensure the path goes via the relay. The reason this works is because the path is split at the relay, and generated as two shorter paths, each of which is subsequently converted into a data flow.

This means it is then necessary to generate threats representing the propagation of compromised data via the relay. This increases the number of system threats and the length and complexity of threat paths, adding to the size of the fully analysed model and degrading performance for validation, risk calculation, and any system model load/store operation.

If the same Data Flow was inbound and outbound at the relay Process, these propagation threats would not be needed. Construction patterns and sequences should be modified so the inbound and outbound data flows are joined together into one end-to-end flow.

mike1813 commented 9 months ago

First step: fix the issue with DataPath construction (issue #91).

mike1813 commented 9 months ago

DataPath issue now fixed, including fixes for the DataPath generation and the associated DataChannel generation using DataPaths.

Next step is to join together the data flows either side of a transparent data relay.

mike1813 commented 9 months ago

Looking in more detail, there are some situations where a non-DB process may need to act as an 'active' relay. Specifically, in cases where the encryption method or key used to protect the data changes as the data flows through the process. This may be needed at an organisational boundary, for example (encrypting data sent unencrypted up to the boundary). In such cases, there should be two separate data flows leading to and from the relaying process.

In such cases, the 'serves' relationship is used to create a relay, but this should not be a transparent relay. The proposed changes would make this impossible unless the relaying process was a DB process, which should not be necessary.

The best solution is to create a new Process-Data relationship 'relays' to indicate that the process acts as a transparent relay for the data. This can be used when the Process is not a DB process, or potentially even when it is (to indicate that queries are forwarded without processing).

The only problem is that 'relays' is already used as construction state in the data flow sequence, so to make this work a different URI must be assigned to that. This doesn't create a backward compatibility issue because that 'relays' relationship is inferred and then deleted at the end of the construction phase.

It will mean existing system models will continue to function as they do now, with 'serves' indicating a non-transparent relay. But there will be an option to change this to 'relays' if the process is forwarding data without touching it in any way.

mike1813 commented 9 months ago

Now checks out in unfiltered mode, changes added to branch 65. Need to test in filtered mode, which will cause some assets and relationships to be purged at the end of the construction phase.

mike1813 commented 9 months ago

Tests in filtered and unfiltered mode match.

mike1813 commented 9 months ago

Some residual issues found in regression tests to check corner cases.

mike1813 commented 9 months ago

Fixed in a further update on branch 65.

Spyderisk / domain-network

Refine model of data relays #89