Bug in Cache Generation

When a Data Flow to a Process cannot be used immediately, e.g., because it is one of several inputs to a service, or because it arrives in a context where the process cannot run, the data flow must be cached. This is addressed by a construction pattern sequence.

The current sequence contains a couple of bugs that should be fixed:

[ ] The patterns do not cover the possibility that a service may be forwarding a data flow between two clients. If one client sends data, and the service cannot forward it until requested by another client, then the data must be cached.
[ ] The INS includes are wrong in the existing pattern, which deals with the case where a service receives different inputs from different clients. This means there are two distinct Data assets, and the wrong one is referenced in the constructed DataCache.

Best to address these at the same time, because the solution to one is likely to affect how to solve the other.

There is also an issue with the surfacing threat D.A.DallDS.0, which causes Loss of Availability at a Data asset (representing a type of data) if all copies of the data are unavailable. This threat is suppressed if there is an uncompromised cached data copy, which should not be the case. See #120.

In these scenarios, the underlying assumption is that when a Process runs, it will try to use all relevant inputs to create all potentially relevant outputs.

In what contexts (i.e., in which physical location) could the Process run? The working assumption is that the Process may run in a context if:

a local interactive user is present (physically accessing the Process host) in that context,
a remote interactive user is connected via a remote access service in that context, or
the Process receives a client request in that context.

An input is relevant if it is (a) sent by the initiating client or user, (b) obtained from a service, (c) stored on the Process host or (d) necessary for the Process to run. An output is relevant if it is (a) returned to the initiating client or user or (b) sent to a service.

Some variation in the Process execution is therefore assumed. The calculation can run with a subset of non-essential inputs, but uses everything it needs or can access when the calculation is triggered. It can produce a subset of outputs according to the demands of the situation, but can only skip outputs that could not possibly be relevant in that situation. If the Process can take different paths each generating a subset of the outputs and/or using a subset of its inputs, where these subsets cannot be deduced from the situation, then it should be modelled as several Processes.

The calculation of an output by a Process is not possible in a context if necessary inputs are not available in that context. In that situation, the Process can either delay the calculation or drop it.

One must decide in which contexts the Process may be unable to send output to a service. This was handled incorrectly in the previous implementation. It is not necessary that the client be unable to connect to a subnet over which it can send a request. It is sufficient if the service may be unable to receive that request in any location where it may be. This needs to be fixed, as discussed in issue #121.

An input is necessary if either:

the Process depends on having access to the input, i.e., there is a Process-reads-Data relationship
the Process gets the data from one process and sends the same data 'as is' to some other process

In the second case, the data is strictly not a Process output and it need not be a Process input, but it is treated as both. This case may arise when the Process only serves or relays the data, which implies that distinct inference rules will be needed. For that reason, this possibility should be treated separately in caching inference patterns.

It is assumed that if the Process is initiated by an interactive user, calculations will be performed only in contexts where all the necessary inputs can be accessed and all outputs can be delivered. If this is not the case, the user interface will display a 'try again later' error. In other words, we assume that rather than delaying a calculation until inputs are available, the Process will pass the delay back to the interactive user.

This means no caching is inferred arising from user interactions, but also that the system model is inconsistent if there are no contexts accessible to the interactive user where the Process can access all necessary inputs and deliver all relevant outputs to services.

If a client sends a request to the Process causing it to run in some context where some necessary inputs are not accessible, it is assumed that the calculation is delayed until these inputs can be obtained. With these assumptions, it can be inferred that:

[x] Case 1: If a client sends data (not necessarily input) not saved on the Process host, and the Process sends it to another client, the data is cached on the Process host until the second client sends its request. The need to cache is not related to the context for either client request.
[x] Case 2: If a client sends data (not necessarily input) not saved on the Process host in a context where the Process may be unable to access a service to which it should forward the data, then the data is cached until the service is accessible.
[x] Case 3: If a Process gets data (not necessarily input) from one service and forwards it to another, but there is no context where both services can be accessed, then the data is cached between arrival and departure.

These three cases apply if the Process serves, relays or processes the data. The remaining cases do not arise if it only serves or relays the data:

[x] Case 4: If a client sends necessary input not saved on the Process host, and a different client may also use the Process, the input is cached because it will be needed when the second client sends a request. The need to cache is not related to the context for either client request.
[x] Case 5: If a client sends input not saved on the Process host in a context where the Process may be unable to access a necessary input not saved on the Process host, the calculation must be delayed so the input from the client must be cached.
[x] Case 6: If there is no context in which the Process can access all necessary inputs not saved on the Process host, those inputs must be cached on the host until the last (which could be any of them) can be accessed.

Note that case (4) ensures that necessary input from a client will be saved if other clients use the Process as a service, so the necessary input in case (5) must be obtained from a service. This in turn means the unsaved inputs in case (6) must also come from services.

Finally, what about outputs (other than forwarded data)? If inputs are cached as above, then output can be calculated when it may be possible to send it. Output for a client can be created and sent when the client reconnects, but output sent to a service may still need to be cached:

[x] Case 7: If a Process creates an unsaved output in a context where it could be sent to a service, but it is possible that the service may be out of range, the output must be cached on the Process host.

This last case arises simply because if the host of the service where the output should go is mobile, it is possible that in some locations it would not be accessible from the Process. The Process as a client can choose when to try to connect, and can delay generating output until it is ready to connect. However, it cannot know in advance whether the output can be sent if there is a possibility that the service could move out of range.

With these assumptions, if there is a Process with a necessary input that cannot be accessed in any context, then the system model is inconsistent. Thus we also need two modelling errors:

[ ] Error 1: If an interactive Process has no context where it can be accessed by its user and it can access all necessary inputs from services and deliver all outputs to services, then the system model is inconsistent with domain model assumptions.
[ ] Error 2: If a Process has a necessary input that cannot be accessed in any context, then the system model is inconsistent.

Case 6 is not quite addressed correctly. The inference rules as implemented in branch 40 now finds two distinct, necessary inputs not stored on the Process host each coming from a Service, and assumes the first input is cached if there is no context where the Process could get input from one and be sure of also getting input from the other.

The idea is that in this situation, when the Process starts the calculation it tries to get both inputs. Whichever service becomes accessible first is the first service in the construction pattern, from which the first input is obtained. But if there is a possibility of the second service being out of range, it may be that access to the second input is blocked. In that case, the calculation must be delayed until the second service returns to a location where it can be accessed, during which time the first input must be stored.

The logic is not quite right as the second input may be obtainable from some other service, so finding one source to be out of range doesn't mean the second input will be unavailable. To express this, one would need to insert access interruption links to Data Flows and Data, not just client-service channels. At this point that hasn't been done because:

Services are rarely hosted on mobile devices unless the clients are also mobile and travelling with the service host (e.g. when a smartphone provides a gateway for wearable IoT). In that situation, Case 6 does not arise, so it is likely to be rare.
Adding features to allow a more precise pattern would add complexity and delay the integration of branch 40 fixes, some of which are more serious and should be made available as soon as possible.

The pattern used means caching may be inferred when not needed. Genuine threats via cached data will not be missed, but some spurious extra threats may be added. This is compatible with the principle that any lack of fidelilty should lead to risks being overestimated rather than underestimated. On that basis, using a simpler pattern so other updates can be integrated sooner is acceptable.

However, the same simplification does make it difficult to create modelling error threats (Error 1 and Error 2). For now, we will need to make do without these modelling error threats. However, this dangling issue will need to be fixed at some point.

The changes have fixed problems in my current test cases, so I pushed the changes to branch 40.

I can't easily create test cases for every scenario because of the issues with the representation and inference patterns for interactive user access to data, as described in #107. Plan is to create a temporary fix for some of those issues on branch 40, so I can create a few more tests for the cache construction sequence.

After that it would make sense to merge changes from branch 40 into branch 6a.

Created a new set of tests covering cases 1 to 6 above. All take the form of three processes with two uses relationships, in which the middle process is the one that may need to cache data.

In each case there is one or sometimes two scenarios in which the cache should be inferred to exist, with names Case 1a, 1b, etc. There is also one scenario where the data cache should not be inferred, with names like Case 1x, Case 3x, etc. There is no Case 2x because that would be identical to Case 1x.

The asserted system models for these tests are in this zipfile: Issue 109 Tests.zip.

The results are as we would expect - a cache of data D1 or D2 is inferred to exist (or not) on the mobile host N1.

In Cases 2a, 5a and 5x the middle process is a service on a mobile device getting data from a client that cannot access the service unless it is in a specific location. These three cases also exercise case 7 above, and in each case a cache of D1 or D2 is also inferred to exist on the client host H1.

This is a little confusing, because in some of those cases, the same data is cached on N1, but formally it is correct.

The zipfile contains one further test case, based on a reasonably common IoT scenario, in which a user wears an activity sensor connected via Bluetooth to their phone. The user also has an application running on their PC in which they log their meals. That data is saved to disk, and served by a simple data service. An app on the phone receives data from the sensor and uses it along with the meal data to create a diet plan, which it displays to the user and stores via the same data service on the user's PC.

The PC remains at home, but the phone and sensor may be carried by the user when they go out. This means the phone app:

can receive data from the sensor at any time
cannot get meal data unless the user is at home
cannot save the diet plan unless the user is at home

The model once validated contains one data cache, for the sensor data which is stored by the app on the phone. This is needed because the app cannot run until it can access meal data, so it must cache the otherwise unsaved flow of input from the sensor. The app generates its output (the diet plan) when the user gets home and the phone can access their meal data.

There is no need to cache the meal data on the phone because the phone uses it immediately (having cached the sensor data it needs as the second input). There is no need to cache the diet plan because the phone cannot save that when the user is out, so it shouldn't generate output when the user is not home (not only because it can't access the meal data). It should generate the output when it is in range of the data service (i.e., when the user is home), and because the data service runs on a fixed host in that location, there is no possibility that the data service could then be out of range.

Note that the data service could be down, leading to loss of availability in the flows of data to and from the app, but this loss of availability is a deviation from the expected behaviour, so it is assumed not to cause the app to cache the data.

Problems arising from #107 that prevented good tests against issue #109 have now been addressed. This was achieved by making a partial fix for #107, and reformulation of the test cases to avoid problems not covered by this partial fix.

In addition, the threat to persistent data availability now ignores (non-persistent) cached copies of data flows, addressing #120. To do this involved fixes also for #123 and #124.

All these fixes are now on branch 40, so a pull request can now be raised addressing this issue.

Work on issue #107 revealed one more possible caching scenario, where uncached input is sent to a service causing it to execute in some context, and the service produces output destined for a second service that cannot be accessed in that context. This appeared in a test case for issue #107 - the IoT scenario discussed above Issue 107 Test 1a - Asserted.nq.gz.

In this case, a sensor acts as a client sending input to the DietPlanner which runs on a phone with which the sensor is paired, and generates updated diet plans considering the user's activity levels. The DietPlanner also uses input data Meals, which must be fetched from a service running on a PC accessible from only one location. Consequently, the sensor input should be cached and used when the DietPlanner is next within range of the PC. The output data DietPlan is stored via the same service running on the PC . It should not be cached because it is only generated when the DietPlanner is in range of the PC.

In practice, problems in the model for user interactions with data (as described in issue #107) cause these inferences to fail. Input data 'Meals' is inferred to come from an interactive user - which is true, but only via a separate process running on the PC. Due to the shortcomings in the model of user-data interactions, it is inferred that Meals is input via the DietPlanner process, so it doesn't need to wait until it can fetch this data from the service running on the user's PC.

This has two consequences. First, the need to cache the sensor input is missed - something that can only be fixed by addressing issue #107. Second, if the DietPlan could be created in any location, it should be cached until the user comes within range of the storage service on their PC. Changes to address #109 mean this is no longer inferred. While the need to cache output only arises because of the failure to detect that an input should be cached, the output should be cached if it can't be sent in a context where input from a client is not cached.

There are two ways this omission could be resolved:

Ensure input received from a client is cached if if arrives in a context where output for a service cannot be sent. The assumption in this case is that a process would not generate output it couldn't send, so the calculation would be delayed which means caching the inputs from which it is computed.
Ensure output sent to a service is cached if the process may be forced to perform calculations by receipt of uncached input from a client in a context where the output cannot be sent.

The first solution seems more realistic in this test case, though that may be because we know processing should be delayed due to the presence of a second, necessary input (Meals). If this were removed from the model, it is less clear that the input would always be cached and processing delayed. If we implement the first solution, we're excluding any possibility that an output may be cached.

The second solution also works in this test case, given that the sensor input is assumed to be uncached input, even though in this test case, the input should have been cached. It seems less realistic, but if the second input (Meals) is removed from the model, we may find in some cases it is the output that should have been be cached. The second solution allows users to choose which way to go. If the presence of an input cache is not inferred, the user can assert that the data is stored (or cached) on the process host. To make this work, we need two changes:

[x] a new construction sequence, executed after the input caching sequence, to determine in which contexts a process must run due to the receipt of uncached input from a client, and
[x] a new construction pattern, executed after that, constructing an output cache if it cannot be sent from one of those contexts.

This situation may be thought of as a new case 8, although the new construction patterns should be inserted before the one that infers the presence of an output cache due to case 7.

Now fixed on branch 107, allowing previously failing tests for user interactivity to be used addressing issue #107.

We don't yet have the two modelling error threats as discussed above, Error 1 and Error 2.

These are difficult to implement, as they require threat patterns that match a condition not being met. They are now moved into a separate issue (issue #138), so this one can be closed.

Spyderisk / domain-network

Bug in Cache Generation #109