loop normal form properties

remi-delmas-3000 commented 1 year ago

@thomasspriggs @martin-cs Here is a summary of the properties of loop normal form for goto programs:

Viewing a GOTO program as both the sequence of its instructions as the control flow graph induced by the sequence structure and its GOTO statements: each instruction is a node, there's an edge n1->n2 iff n2 is the successor of n1 in the sequence, or if n1 is a GOTO instruction with n2 as jump target. The entry point of the graph is the node of the first instruction.

A loop in the CFG is a set of strongly connected nodes. The loop is natural iff there is a node in the loop, the header, that dominates the other nodes of the loop. An edge going from an instruction of the loop to the header node is called a back-edge. A node that has a back edge is called a latch node.

A node is an exiting node if it has at least one successor that is not in the loop. That successor outside of the loop is called an exit node of the loop.

The properties of a normalised natural loop are:

The header node is the only node with predecessors that are not in the loop
The header node has a unique predecessor called the pre-header node
The predecessors of exit nodes are all in the loop (i.e. control can only reach an exit node without actually going through the loop instructions first)
The loop has a single latch node

We say that a natural loop is densely packed in the goto program iff the sub-sequence of instructions starting at the loop header instruction and ending at the loop latch instruction only contains instructions of the loop and iff the preheader node is right before the header node in the sequence.

A goto-program is loop normal form iff all the loops it contains are natural, densely packed and if the only edges that jump to an instruction with a lower index in the sequence are back-edges of natural loops.

These notions are captured in two functions:

a checker function that checks if the loop normal form properties holds on a goto program, and returns either success or an error value describing why it does not hold, e.g. "exit node 'n' of loop 'l' has a predecessor that is not in 'l'", "there is a backwards jump in the sequence that is not a natural loop back-edge")
a normalisation function that rewrites a goto program's instruction sequence to loop normal form, and returns either success if the operation succeeds, or an error value describing the reason for failure.
the normalisation function should be idempotent, i.e. should not modify a program that is already in loop normal form (it can use the checker function to decide wether there is work to do or not).

thomasspriggs commented 1 year ago

The way the pre-header node was explained in the recent meeting appeared to require the use of a labelled skip instruction. It is worth noting that there is an existing normalisation pass in the cbmc code base which is called from tens of places called remove_skip. As the name implies it will remove skip instructions from goto programs. It can move labels around in order to remove skip instructions. Therefore it could potentially combine the pre-header node from this initial suggested form with the header node. Therefore think that we should not make the notion of a mandatory pre-header node part of our definition.

peterschrammel commented 1 year ago

Are header nodes of nested loops allowed to coincide?

feliperodri commented 1 year ago

https://github.com/diffblue/cbmc/issues/7506 https://github.com/diffblue/cbmc/issues/7517

remi-delmas-3000 commented 1 year ago

Therefore think that we should not make the notion of a mandatory pre-header node part of our definition.

Agreed, we can always add these nodes as needed afterwards.

I think the same observation could apply to exit nodes. We would like exit nodes to have only incoming edges coming from loop instructions so that the header of the loop dominates the exit nodes and that any instrumentation we may add on exit nodes is only executed when the loop has been entered once.

This may require the insertion of a skip node to separate incoming edges from the loop from incoming edges coming from somewhere else.

I am not sure however that reintroduce these skip nodes as needed after the fact is as easy as in the pre-header node case.

remi-delmas-3000 commented 1 year ago

Are header nodes of nested loops allowed to coincide?

I'd like to rephrase the definition a bit to make it clearer that a natural loop is specific to a back edge:

A back-edge is an edge from a node l to another node h of the loop such that h dominates l from the entry node of the CFG. The dominance relation h dom l implies the existence of a path from h to l, the back edge that of a connected component in the CFG that contains l and h. The natural loop of the back-edge l->h is the smallest subset of nodes the CFG that contains both l and h and that is strongly connected.

So by that definition it seems like two distinct natural loops (i.e that have different back edges and are each a minimum size SCC) can have the same header node.

It seems like the pre-header node requirement would reject sharing though.

It would be nice to have a checker function that can optionally check for the presence of pre-header and dominated exit nodes (since having these properties will really make loop contracts instrumentation easier).

thomasspriggs commented 1 year ago

Is it worth considering some sort of requirement for basic blocks to be maximal as part of this definition? For example if we have 2 single entry single exit blocks of instructions A and B where block A ends with a GOTO which jumps to block B then the two blocks should be combined and the GOTO instruction removed, in order to reach normal form. I am thinking about resolving issues such as the one seen here - https://github.com/diffblue/cbmc/issues/7506 Cases like these, where code superficially looks like a loop due to the backwards GOTO, but no loop of any kind is actually present.

I'd also like to ask how closely does the existing code for detecting natural loops in src/analyses/natural_loops.h fit with the natural loop definition, which we are aiming for in this issue?

martin-cs commented 1 year ago

Thanks @remi-delmas-3000, this is a good foundation.

I don't think any of this is wrong but here are a load of pedantic questions with the aim of tightening things up:

there's an edge

Assuming that THROW instructions have been removed?

Good point. I was implicitly assuming loop normal form properties as an addition to the core GOTO fragment documented here : https://github.com/diffblue/cbmc/pull/7505

The entry point of the graph is the node of the first instruction.

First in terms of sequence of instructions? In pathological cases this is not guaranteed to be the first instruction executed.

Could you provide an example ? The only case I can think of is that of __CPROVER_preconditions that are pulled out of functions at the call site.

The loop is natural

For clarity, are you expecting this definition to be equivalent to the one in src/analyses/natural_loops.h ? I am not convinced that one is correct or really captures what the code does.

Not necessarily, as I am not really certain what src/analyses/natural_loops.h computes. A first step could be to give a spec to natural_loops.h.

the header, that dominates the other nodes of the loop.

This is not unique is it? You can, and often will, have multiple nodes with this property?

We want this node to be unique.

An edge going from an instruction of the loop to the header node is called a back-edge.

So which things are back edges depends on the choice of header? And these may not co-incide with instruction-sequence or CFG back-edges?

Yes this depends on choosing the header. For a CFG that was successfully put in loop-normal-form, we would like to have natural loop back-edges be the only instruction-sequence back-edges.

A node is an exiting node ...

Why not have entrance nodes defined in the same way?

Doesn't having (by definition) the header node dominate all nodes of the loop rule out multiple entrance nodes ? Having a single entry (the header) and a pre-header node is important for loop contract transformation.

... properties of a normalised natural loop are ... This seems pretty reasonable. It is slightly more general than what we were using in ASVAT as it allows multiple exits.

The predecessors of an exit node are all in the loop... I can see why you want this but it might be a little bit fiddly. because they could also be merge targets, for example:

if (thing) { while (other_thing) { stuff(); } }

I that case we would need to distinguish the exit node of the loop from the jump target of if(thing) { .. }.

... single latch node ... I can see why this is useful and the definition works but be careful with how nested loops work.

I note that there are a couple of differences with this definition and LLVM's one. Your header is not required to be unique, or an entry node. Also I note that there is no maximality condition so:

A -> B A -> C B -> D C -> D D -> A

A, B, D is a loop? I guess the problem with maximality is that it makes nested loops hard to handle. I'd say no according to the "for all loop nodes, all predecessors are in the loop" since D has C as predecessor and C is not in A, B, D.

@martin-cs have you read this addendum (there's a minimality condition which I added to account for nested loops (if two natural loops are nested the minimality of each SCC should make nested loops emerge naturally).

A back-edge is an edge from a node l to another node h of the loop such that h dominates l from the entry node of the CFG. The dominance relation h dom l implies the existence of a path from h to l, the back edge that of a connected component in the CFG that contains l and h. The natural loop of the back-edge l->h is the smallest subset of nodes the CFG that contains both l and h and that is strongly connected.

thomasspriggs commented 1 year ago

I have questions about the applications of the normal form we are discussing in this ticket.

the main application is as a precondition loop contract transformation, where the densely packed sequence of instructions of a loop is replaced by a nondeterministic loop free check base + havoc + check step instruction sequence. This is an opt-in transformation, and we plan to invoke loop normalisation + loop normal form check on a goto program only if the user wants to apply loop contracts to that function. Most of the problems we have with loop contracts today will be solved by having having the sequence densely packed, having a unique pre-header node, and having dominated dedicated exit nodes.
the second application is to avoid asking the user for unwinding directives when there is no natural loop involved. We get this through the "the only instruction sequence back edges are natural loop back-edges". In addition, if this one day becomes an actual invariant of a core GOTO programs, symex could also rely on it to handle SSA counters.

We need to consider how both normalised and awkward loop cases are handled. Awkward examples can always be valid goto-programs because they can be written by hand in C. CBMC should ideally be able to analyse any of these awkward cases.

Yes. For now what we are asking for loop contracts is a function that will tell us if a program is in loop normal form or not, and a normalisation function that would succeed when goto programs from structured C programs programs that use if-then-else, while-loops, for-loops and do-while-loops with break and continue statements.

This would imply that an input which fails the loop checking function is still valid for analysis. I acknowledge there is an ideal to normalise awkward inputs. However until we have a working normalisation implementation for all possible inputs, I think that goto-programs containing both normalised loops and awkward loops must both be considered valid.

Yes. That's why this normalisation pass would be opt-in and invoked when needed for loop contracts, and optionally on programs coming from kani.

So should we be aiming to add labels to the goto program or some other data structure such that the normalised loops can be marked and subsequently processed by loop-specific passes and the other awkward loops bypassed?

Yes that would be really good. So maybe the loop normal form checking function, instead of failing on awkward loops, could return a collection of loop descriptors and the "normal form/awkward" flag is just part of that loop descriptor. I dont' think tagging the actual loop instructions is a good idea, since normal form properties can be invalidated by program instrumentation/transformation passes.

Example input C program

```C unsigned int factorial(unsigned int number) { unsigned int factorial = 1; for(unsigned int i=0; ++i<=number;) { factorial = factorial*i; } return factorial; } ```

Example labelled loop GOTO program

```C factorial /* factorial */ // 0 file ./factorial.c line 4 function factorial DECL factorial::1::factorial : unsignedbv[32] // 1 file ./factorial.c line 4 function factorial ASSIGN factorial::1::factorial := cast(1, unsignedbv[32]) // 2 file ./factorial.c line 5 function factorial DECL factorial::1::1::i : unsignedbv[32] // 3 file ./factorial.c line 5 function factorial ASSIGN factorial::1::1::i := cast(0, unsignedbv[32]) // 4 file ./factorial.c line 5 function factorial // Labels: __CPROVER_loop1_header 1: ASSIGN factorial::1::1::i := factorial::1::1::i + 1 // 5 file ./factorial.c line 5 function factorial // Labels: __CPROVER_loop1_latch IF ¬(factorial::1::1::i ≤ factorial::number) THEN GOTO 2 // 6 file ./factorial.c line 7 function factorial ASSIGN factorial::1::factorial := factorial::1::factorial * factorial::1::1::i // 7 file ./factorial.c line 5 function factorial GOTO 1 // 8 file ./factorial.c line 8 function factorial // Labels: __CPROVER_loop1_exit 2: DEAD factorial::1::1::i // 9 file ./factorial.c line 9 function factorial SET RETURN VALUE factorial::1::factorial // 10 file ./factorial.c line 9 function factorial DEAD factorial::1::factorial // 11 file ./factorial.c line 10 function factorial END_FUNCTION ```

The specification proposed allows for multiple exit nodes with a single latch node. But doesn't multiple exit nodes imply multiple latch nodes as well?

The latch node in that example seems to be instruction 7 GOTO 1. Instruction 5 IF ¬(factorial::1::1::i ≤ factorial::number) THEN GOTO 2 is an exiting node with corresponding exit node instruction 8. Instruction 7 is not an exiting node since its only successor is the header node. Multiple exit nodes start appearing when shortcutting conjunction or disjunction is used in loop guards or when break or continue statements are used in the loop body.

remi-delmas-3000 commented 1 year ago

@thomasspriggs

Is it worth considering some sort of requirement for basic blocks to be maximal as part of this definition? yes block maximality would be good to have. For https://github.com/diffblue/cbmc/issues/7506 the core problem is having an edge that goes back in the instruction sequence but that is not the back edge of an actual natural loop.

martin-cs commented 1 year ago

I have questions about the applications of the normal form we are discussing in this ticket. We need to consider how both normalised and awkward loop cases are handled. Awkward examples can always be valid goto-programs because they can be written by hand in C. CBMC should ideally be able to analyse any of these awkward cases. This would imply that an input which fails the loop checking function is still valid for analysis. I acknowledge there is an ideal to normalise awkward inputs. However until we have a working normalisation implementation for all possible inputs, I think that goto-programs containing both normalised loops and awkward loops must both be considered valid.

Agreed. We need algorithms that are robust, or, at least, aware of awkward loops.

So should we be aiming to add labels to the goto program or some other data structure such that the normalised loops can be marked and subsequently processed by loop-specific passes and the other awkward loops bypassed?

I think it will need to be an auxiliary data structure because labels are not sufficiently "brittle". Some of the proposed properties can be easily broken by things like remove_skip and other passes. If you label then you will need to explicitly invalidate the labels. If it is an auxiliary data structure then you will have to directly maintain the link or recompute when the goto program changes.

The specification proposed allows for multiple exit nodes with a single latch node. But doesn't multiple exit nodes imply multiple latch nodes as well?

Is your thinking that these will all be of the form "exit or go back to the top"? It is a fair point but the "one latch edge" is possible by creating a single SKIP node and redirecting all latch nodes to that. This is an example of the previous "normal form properties that are fragile to program transformations".

martin-cs commented 1 year ago

Is it worth considering some sort of requirement for basic blocks to be maximal as part of this definition? yes block maximality would be good to have. For #7506< https://github.com/diffblue/cbmc/issues/7506> the core problem is having an edge that goes back in the instruction sequence but that is not the back edge of an actual natural loop.

Note that remove_skip will merge basic blocks if they are sequential.

Also I was sure that somewhere we had the "jump forwarding" pass that does something like:

if (it->is_goto() && it->get_target()->is_goto()) {
  if (it->get_target().condition().is_true()) {
    it->set_target(it->get_target()->get_target());

  } else if (it->get_target().condition().is_false()) {
    it->set_target(std::next(it->get_target()));

  }
}

which might also help reduce basic block size. It seems

src/goto-programs/ensure_one_backedge_per_target.cpp

has half of this.

remi-delmas-3000 commented 1 year ago

@thomasspriggs do you think the requirements sufficiently well defined now ?

thomasspriggs commented 1 year ago

My team has re-prioritised some of the work. I now have other documentation to write first. So I am not actively working on this issue for the moment.

feliperodri commented 1 year ago

This is not a blocker for our GOTO standardization anymore, so we can remove the high severity flag. We have also worked around corner cases in the GOTO instrumentation for loop contracts.

diffblue / cbmc

loop normal form properties #7518