Data Flow inference bug

mike1813 commented 7 months ago

There is a slight issue with the current data flow inference sequence (in release v6a3-2-2). It works as follows:

Find all possible paths Data could take between communicating Processes, starting from Processes that create or serve the Data, and reaching any Process with a chain of Process-Process communication links to a starting Process
Find all possible channels = paths connecting Processes that create or serve the Data to Processes that consume the Data
Find which of channels are consistent with the data lifecycle by tagging them as 'active' if they start at a Process that creates or consumes Data, or follow on from such a channel.

The problem seems to be in step 3, which leads to some channels being activated that are not strictly needed. Attached is a test case for this, which if validated using release v6a3-2-2 acquires some data flows between C1-S4 and S3-S4 that appear to be redundant.

This is not a huge problem because no valid threat will be overlooked. It just means extra (possibly unrealistic) threats are created that affect these extra data flows.

This may cause some problems if we add data Fields to our models (e.g., to address #39). At that point, it will be necessary to work out which Fields are present in Data Flows, which will have to be done using dependencies between Data Flow and stored Data Copy assets. Having some extra (possibly unrealistic) Data Flows may cause extra dependencies, such that Fields are assumed to be present in some other (realistic) Data Flows when this is not really the case.

In principle, this should also lead to extra (possibly unrealistic) threats being created, rather than valid threats being missed. However, it may still cause a problem if a Data asset has multiple related fields, and the number of extra threats then becomes too large.

This should be addressed if/when we can, either by improving Data Flow inference patterns, or by integrating the analysis of which Fields are included so it doesn't depend on any spurious additional Data Flows.

DataFlow-Test-09s-PlusS-Modified - asserted.nq.gz

mike1813 commented 5 months ago

Various fixes made that seem to solve the problem in more cases than previously. This is definitely an improvement on what we had before, but still not a complete solution. It looks like there is no way to achieve that without some extra user assertions.

Here are the test cases used at this point:

DataFlow-Test-09s-PlusS-Modified - asserted.nq.gz: the original test that led to this issue being created.
DataFlow-Test-10s - Asserted.nq.gz: a more extreme example with two viable paths, the shorter one being used by default.
DataFlow-Test-10s-PlusS - Asserted.nq.gz: uses multiple relays to direct the flow via the longer path, which led to backward flows between the two intermediate relays.
DataFlow-Test-10s-PlusR - Asserted.nq.gz: the same case again, but this time with 'transparent' relays using the enhancements described in issue #89.
Check Collocated Process Comms 2 asserted.nq.gz: an example in which a process updates locally stored data, but with a remote interactive user.
Check Collocated Process Comms 1 asserted.nq.gz: the same case with a naughty attempt to make remote access go via a reverse proxy.

There are limits to what can be achieved. Branch 105 seems to give the right outcomes for all these except the last case. That runs into the problem that the path from the remote terminal to the data goes via the website, not the text editor. This then produces a modelling error, because of course the remote terminal shouldn't give interactive access to a non-interactive service.

However, the data flow patterns could not cope with this - two clients that in principle use separate paths start by going through a common process (here a reverse proxy). Some improvements in remote access inference could fix that, but the data flow would still be wrong - the shortest path from the proxy would still be favoured for both clients. There is no way (even by asserting relays) that the clients could use different paths in this case, even in principle.

mike1813 commented 2 months ago

There are still some ongoing issues with the data lifecycle construction sequence. In the attached model, we have:

a Sensor that senses Data (i.e., the onboard SensorProcess creates the Data),
an ApplicationProcess that polls the Sensor (i.e., the ApplicationProcess receives the Data),
a DataService that serves the Data, collocated with and used by the ApplicationProcess.

The data lifecycle in this case should be as follows:

the SensorProcess creates the Data (represented by a DataOutput asset),
the ApplicationProcess receives the Data (represented by a DataInput asset),
the DataService serves the Data (represented by a StoredDataPool asset),
the Data flows from the SensorProcess to the ApplicationProcess where it may be used as input,
the Data is forwarded by the ApplicationProcess to the DataService where it is stored,
the Data flows from the DataService to the ApplicationProcess if previously stored data is needed as input.

Because the SensorProcess 'creates' rather than 'appends' the Data, one might surmise that previously stored data is overwritten, making the last step unnecessary. This begs a separate question, as discussed in #118 - in practice the data would not normally be overwritten, so the ApplicationProcess may or may not need to access it.

The reverse flow of previously stored data would give rise to some threats, so on the principle that inference assumptions should overestimate rather than underestimate risks, this flow should be assumed to occur. (Users could then use a 'disablement' control at this data flow to prevent those threats where the ApplicationProcess only stores data and never requests stored data again).

With the current domain model (v6a5-1-1), the flow of data from the SensorProcess to the Data Service via the ApplicationProcess is not generated. As a result, the reverse flow to the ApplicationProcess is also not generated. The construction sequence then fails to the find the flow from the SensorProcess to the ApplicationProcess. The reason this happens is because the construction pattern assumes data should flow via the stored copy, given that the ApplicationProcess does use the DataService.

Note that if the ApplicationProcess did not use the DataService, the construction sequence should conclude that the stored copy of the data is written to (and possibly also read by) the ApplicationProcess. The DataService then only serves other processes that use the data. However, the existence of the DataService still causes problems for the construction pattern that should detect the flow of data between the SensorProcess and ApplicationProcess.

mike1813 commented 2 months ago

This problem was investigated using two further test cases covering the scenarios described above:

Testing_SensorStoredData - Asserted.nq.gz: a Sensor senses asserted data that is polled by a remote application process and stored on its host using a collocated data service.
Testing_SensorStoredData_2 - Asserted.nq.gz: as above, but the polling process does not use the data service and instead saves the data directly on its host.

The original problem from the first test case is a failure to infer the data flows from the sensor process to the application process and from the application process to the data service. This caused by construction patterns RA-fDCSDD+ch, DO-fDCSDD+ch and DADSDCSDD-f+c ,which deduce which data channels carry data flows from data producer processes to a data consumer or data storage service, The first two patterns find data channels connecting the a producer process to an immediate but not necessarily ultimate destination (which in this test case is the data service that stores the produced data). The third pattern is iterative, and finds data channels from a previously reached process (not the data producer) to a further destination along the same path. The channels are tagged as carrying data by an extra constructed relationship, which is used in a subsequent pattern that adds the data flow along the data channel.

The error in these patterns is the use of an 'isPool' relationship to determine whether the far end of a channel is a destination process. This detects channels to a data relay, a data service, or a data processor that acts like a data service. However, in the first test case the immediate destination is a data processor that acts like a relay rather than a service, even though the data is stored on its host. It lacks the 'isPool' relationship, so the channel is missed by DO-fDCSDD+ch. The subsequent channel from there to the data service is then also missed by DADSDCSDD-f+c. In these patterns 'isPool' should be 'isDest', which marks the process as any destination (not necessarily an ultimate destination).

This is not enough to detect data flows correctly in the second test case. There, we should get a flow from the sensor process to the application process, but because neither communicates with the data service, there is no onward flow to the data service. In this test case, the situation is more complex, because channels to a consumer from a data source should only carry data if they connect (directly or indirectly) to an ultimate data source. Patterns RA-fDCSDS+ch, DI-fDCSDS+ch and DD-fDCSDS+ch only tag data channels as potentially needed, and this is upgraded by a further iterative pattern DSDAfDD-fS+fS where channels reach an ultimate data source.

This sequence doesn't work properly because:

'isPool' rather than 'isSource' is used in construction patterns RA-fDCSDS+ch, DI-fDCSDS+ch and DD-fDCSDS+ch, the second of which then misses the channel from the sensor process to the application process,
when data flows go backwards along a data channel, there is no mechanism for adding a self-referential link to the ultimate data source, causing them to be missed by the upgrading pattern DSDAfDD-fS+fS, and
the ultimate data source asset having not been created as a 'pool' asset, the serialization/deserialization relationships between the application process and the stored data would not be generated correctly by pattern HPDFDI-fDS-DP+f.

It isn't possible for to change construction patterns such that the sensor process here would be created as an OutputPool since the criteria for detecting this is appropriate are not accessible until much later in the construction sequence. The best solution is to replace 'isPool' by 'isSource' in patterns RA-fDCSDS+ch, DI-fDCSDS+ch and DD-fDCSDS+ch, then add a separate pattern to add a self-referential link to an ultimate data source once that is known. Finally, pattern HPDFDI-fDS-DP+f was replaced with a new pattern that doesn't depend on whether the application process is classed as an output pool.

mike1813 commented 2 months ago

Further investigation revealed that many problems are caused when two data flow paths (chains of sequence of process-process communications that could carry a data flow) overlap. These paths are created by starting at the end, by finding each process that may provide access to a stored copy of data, or produce data as an output. Originally, each path end was specific to a particular data asset but now they depend only on the target process, being used for any data that could be accessed via the process.

Each longer path is then created by taking an existing path where the last process communicates with another process such that they may exchange data (i.e., the communication must be for application purposes not for authentication/authorization only), and the new process is not on the existing path. The new path is created by pre-pending the new process.

The existence of a data path does not mean data flows along that path - it just means that the processes could send data in either direction along that path. To find out which paths are actually used, one creates data channels (each specific to one data asset) by starting at producer/consumer processes and iteratively following the data paths until one find a data destination/source. Channels that link sources to destination are then 'activated', again using an iterative process, ensuring if data may be forwarded from a first destination to subsequent destinations (including by processes whose only role is to relay the data), these paths will be activated.

Where this goes wrong is when a data path exists to one process, and this process is pre-pended to a path to some other process that may be used by the same data. The second path is then further extended, so that subpaths of the first path are also subpaths of the second path. The problem is that when data channels are created and activated, these subpath relationships are used to test which channels correspond to flows via one destination to another. Being a subpath of two paths at once causes this test to misfire and channels get activated that should not be activated.

This problem has now been fixed.

It is still possible to create system models in which inappropriate data flows may be created. However, it is now less likely, and there are no cases where it is known that appropriate data flows are also missed. That being the case, one can use controls to disable the unwanted data flows, if they can't be eliminated by adding 'relays' relationships between intermediary processes and the data.

On that basis, we now will be able to close this issue when branch 40 is merged into 6a.

mike1813 commented 2 months ago

There are still some known problems with data lifecycle inference, but these are covered in separate issues, e.g., #119, #109, #107.

mike1813 commented 1 month ago

Issue #109 has now also been addressed.

This leaves #119 (which is only exercised in a few quite rare cases) and #107 (which is of more importance, but requires changes that will affect backward compatibility with system models). Updates are worth having even with these issues still pending, so the pull request has now been raised to bring branch 40 into branch 6a.

mike1813 commented 1 month ago

A further issue #128 was found during review of the pull request, which must also be addressed before branch 40 can be merged.

mike1813 commented 3 weeks ago

Issues #128 including #129 and #122 now addressed in branch 40. Residual issues with data flow inference ambiguity now moved to a separate issue #131.

Spyderisk / domain-network

Data Flow inference bug #40