Bug in data lifecycle control inference

mike1813 commented 4 months ago

If a Process serves a locally stored copy of a Data asset (i.e., a Data Copy asset), we get a Stored Data Pool asset associated with the Data and Process, and a Process-enablesAccess-StoredDataPool, meaning it controls the data access. See construction pattern PsDSH-DP+DP.

If other Processes access the same Data, we get DataInput or DataOutput assets associated with their data access. The enablesAccess from the serving Process is then propagated (possibly via other DataAccess assets and communication intermediaries) to any DataInput and DataOutput assets by construction patterns DUDA-eS+eA and DADU-eS+eA.

The presence of an enablesAccess relationship thereby specifies that there is a process enabling access to the serialized data. This is used in two ways:

to determine which process must implement access controls if required for regulatory compliance, used in our GDPR compliance threat models, and
to determine that there is a process enabling access to data stored remotely, used in modelling error threats to find when there is no continuous chain of process-process communications whereby input/output at one process can be fulfilled by a data service.

Things get a bit tricky in cases where either (a) there is no explicit Process-serves-Data relationship, or (b) there is no stored copy of the Data. In the former case, the construction sequence looks for a process accessing the stored copy and makes it responsible for enabling access. This covers situations where a Process accesses stored data as input or creates stored data as output, but may also send the data to another (remote) data consumer. In the latter case, the process creating the data is considered responsible for enabling access by any consumer process, since the creator must be sending the data in messages to the consumer, rather than via a stored copy.

There is a bug in the current sequence whereby no process is inferred to be responsible for enabling access if the only stored copy is used by a Process that uses the data as input, and the data is created by a second process. In this case, the process using data as input is the enabler, since it manages access to the stored copy, and determines whether output from the second process should be stored. However, this combination is not picked up correctly in the current sequence.

mike1813 commented 4 months ago

The problem is in construction pattern DSDPS+DC, which is supposed to create an initial data channel from an output (data source) that is then iteratively extended until a data destination is reached. The matching pattern contains a spurious node, which may not always be present, causing the pattern not to be matched and the associated data channel (and other data channels obtained by iterative extension) to be missed.

Interestingly, the spurious node does not appear in the slide set documenting the construction sequence. It looks like a change was planned but somehow never got implemented.

It is a simple fix to remove the spurious node. The main challenge is to check that the extra channels don't cause problems in cases where the existing channels are sufficient. We have a lot of test cases developed for issues #40, #109, etc., but (a) there are lots of tests covering different (sometimes corner) cases, and (b) there may be corner cases not covered by those tests. It will take time to run the tests, check for (and if necessary fix) any regression issues, and confirm that the tests cover all the required cases.

mike1813 commented 4 months ago

A reasonable number of tests have now been used.

Client service communications: a subset of tests originally created for testing new client authentication models (issues #85 and #86). The attached subset involves two data flows via a common set of intermediaries, with different authentication/authorization models and/or intermediaries. These gave incorrect results due to the bug in DSDPS+DC, which caused one of the data flows to be missed. With the fix described above, they no longer produce this error.
Backward flows: test cases created to check some caching scenarios (issue #109) where data flows in both ways via intermediaries. These also gave incorrect results due to the bug in DSDPS+DC, and since they did not in Jan'24, this shows the bug arose in a fairly recent change. These tests now produce the expected results with the same fix for DSDPS+DC.
Cyclic data flows: test cases originally created to test data flow construction sequences (issue #40) and mechanisms by which users can control the data paths used (issue #89). These tests confirmed that the above fix for DSDPS+DC does not lead to problems with corner cases involving cyclic process-process communication paths.
Remote access data flows: test cases created to check corner cases produced when accessing data via both services and remote logins (issue #102) and some user interactivity issues (issue #107). These also show no new anomalies caused by the fix in DSDPS+DC.

Conclusion. These tests do show that the fix to DSDPS+DC has corrected some errors, and in cases where those errors don't arise, the changes have not altered the outcome. The change should now be merged into branch 6a.

mike1813 commented 4 months ago

Updated on branch 149 and merged into 6a.

Spyderisk / domain-network

Bug in data lifecycle control inference #149