POETSII / graph_schema

Schemas for describing graphs plus simple tools for working with them
3 stars 0 forks source link

graph_schema - Version 4

This repository is intended to describe the format for graphs via:

There will be many different kinds of graphs used, but for the moment this repository focusses on virtual graphs, i.e. graphs which do not worry about physical implementation and layout, and only talk about structure, functionality, and connectivity. The schemas for other graphs could also be added here, but that is up for discussion.

The policy is that things should not be merged to master unless make test works.

Current TODO list

Current BUG list

Version 4.4 features

Version 4.3 features

Version 4.2 features

Version 4 features

Version 3 features

An exemplar XML for version 3, featuring one of everything can be found in master/virtual-graph-schema-v3-exemplar-xml.xml

A conversion script from version 2 to version 3 is included in with this new version of the schema. Information on it's usage is included here

Requirements

Vagrant

Note: you do not have to use vagrant. If you have an existing linux install or virtual machine you may be able to use that - see Manual Installation below.

The recommended way of setting things up (especially under windows) is by using vagrant in order to get a virtual machine that is configured with all the libraries. The file Vagrantfile contains setup instructions that should result in a fully configured Ubuntu 16 instance with all libraries.

So assuming you have vagrant installed, and some sort of virtualiser (e.g. VirtualBox, then you should be able to get going with:

git clone git@github.com:POETSII/graph_schema.git
cd graph_schema
vagrant up    # Wait as it downloads/starts-up/installs the VM
vagrant ssh
cd /vagrant   # Get into the repository
make test     # Run the built-in self tests

Manual installation

The packages needed for ubuntu 18.04 are given in the provisioning script provision_ubuntu18.sh.

This is currently believed to work in both Ubuntu 18 and 20.

pypy

Note: this is completely optional. It only matters if you are working with very large graphs (e.g. millions of devices).

The python tools should all work under pypy, which is a JIT compiling version of python. Many of the graph_schema tools get a decent speed boost of around 2-3x, and nothing seems to run any slower. pypy is particularly useful when generating larger graphs using python, as you usually get about twice the generation rate. There is a provisioning script in provision_ubuntu18.sh which should setup a pypy3 virtualenv, though it is a bit fragile and may require some hand-holding.

To setup to the pypy virtualenv, from the root of the graph_schema repo:

./provision_ubuntu18_pypy.sh

Look at the messages going past, as it is possible something goes slightly wrong.

Assuming it is setup, then do:

source pypy3-virtualenv/bin/activate

to enter the virtual environment. If this succeeds, then you should find that your command prompt is prefixed by (pypy3-virtualenv). From this point on the python3 command is redirected to pypy3, which you can check with python3 -V. To leave the virtualenv, use deactivate.

For example, on my current machine I get this:

dt10@LAPTOP-0DEHDEQ0:/mnt/c/UserData/dt10/external/POETS/graph_schema
$ source pypy3-virtualenv/bin/activate

(pypy3-virtualenv) dt10@LAPTOP-0DEHDEQ0:/mnt/c/UserData/dt10/external/POETS/graph_schema
$ python3 -V
Python 3.6.1 (7.1.1+dfsg-1~ppa1~ubuntu18.04, Aug 09 2019, 16:05:55)
[PyPy 7.1.1 with GCC 7.4.0]

(pypy3-virtualenv) dt10@LAPTOP-0DEHDEQ0:/mnt/c/UserData/dt10/external/POETS/graph_schema
$ deactivate

dt10@LAPTOP-0DEHDEQ0:/mnt/c/UserData/dt10/external/POETS/graph_schema
$

It's not a big deal if it doesn't work, as everything still works in standard python (i.e. cpython).

Schema

The structure of the application format is captured in a relax ng xml schema, which checks a fair amount of the structure of the file, such as allowed/missing elements and attributes. The schema is available at master/virtual-graph-schema.rnc.

For checking purposes, the schema can be validated directly using tring, or is also available as an xml schema (.xsd) which many XML parsers can handle.

To validate a file X.xml you can ask for the file X.checked. This works for graph types:

make apps/ising_spin/ising_spin_graph_type.checked

and graph instances:

make apps/ising_spin/ising_spin_16_16.checked

If there is a high-level syntax error, it should point you towards it.

A slightly deeper check on graph instances is also available:

python3.4 tools/print_graph_properties.py apps/ising_spin/ising_spin_16x16.xml

This checks some context-free aspects, such as id matching. Beyond this level you need to try running the graph.

For reference, an example XML, which features one of everything in the schema, has been included in virtual-graph-schema-v3-exemplar-xml.xml.

Usage

For the top-level check, do:

make test

This should check the schema, validate all examples/test-cases against the schema, and also validate them by parsing them and simulating them.

To look out at output, do:

make demo

This will take longer and use more disk-space - it may also fail as things evolve over time, so you may want to do make demo -k. It will generate some sort of animated gifs in the apps/* directories.

Applications

There is a small set of basic applications included. These are not amazingly complicated, but provide a way of testing orchestrators and inter-operability. Models that are Deterministic should provide the same application-level results independently of the orchestrator. Models that are Deterministic(FP) should provide the same application-level results on any implementation, except for errors introduced in floating-point (see the later notes on fixed-point implementations). Models that are Chaotic have a dependency on the exact execution order and/or message order - they should always provide some kind of "correct" answer, but the answer you get may vary wildly.

Models that are Asynchronous do not rely on any kind of internal clock, so there is no central synchronisation point being built. These may be sub-classed into Async-GALS which use the time-stepping approach, and Async-Event which use the local next-event approach. Models that are Synchronous internally build some kind of central clock (hopefully using logarithmic time methods), so there is an application barrier. Models that are Barrier use the internal hardware idle detection feature through OnHardwareIdle.

Currently all applications are designed to meet the abstract model, which means that messages can be arbitrarily delayed or re-ordered, and sending may be arbitrarily delayed. However, they are currently all interolerant to any kind of loss or error - AFAIK they will all eventually lock-up if any message is lost or any device fails.

Fixed-point implementations

For some applications there will be a *_fix version, which is usually just a fixed-point version of the application. This was originally useful back when there was no floating-point support in the Tinsel CPU. They are still sometimes useful now for testing purposes, as it is generally much easier to make fixed-point versions bit-exact deterministic under all possible executions. Some common problems that can occur with repeatability and floating-point are:

Providers

There are a number of simulators included in graph_schema, all of which need to be able execute the C/C++ handlers specified in graph types. That means that at some time the graph types must be compiled by a C++ compiler to produce binary code, which is relatively fast (e.g. 10s of seconds), but may be much slower than the time taken to simulate a graph (e.g. when simulating 100s of small files during testing). The approach taken in graph_schema is to compile the graph type into a "provider", which is a shared-library that can be loaded by simulators. These usually have the extension .graph.so.

Each provider exports a function "registerGraphTypes", so a simulator can dlopen a provider file, call registerGraphTypes, and the provider will register all of the graph types it knows how to support. The simulator can then create instances of GraphType (from include/graph_core.hpp) from a provider, which provides:

If a tool or simulator cannot find a compiled provider it will create a non executable provider using just the topological information in the GraphType of the source file. This allows tools/simulators to do everything except call handlers, so they can read and write graph instances, but won't be able to run them.

If you see a message complaining that something was "not loaded from compiled provider", it means that a compiled provider for that graph was not found. Most tools will actually print the set of providers they are finding at start-up, and you'll probably find your graph is not on that list. The default search path is:

To compile a provider, use the tool tools/compile_graph_as_provider.sh (see below).

Tools

Graph simulation

tools/compile_graph_as_provider.sh

This takes an input xml file containing a graph type, and compiles it into a "provider" (see earlier discussion on providers). By default it will produce a provider called <graph_type_name>.graph.so in the current directory.

Use --help to see all available options.

bin/epoch_sim

This is an epoch based orchestrator. In each epoch, each device gets a chance to send from one port, with probability probSend. The order in which devices send in each round is somewhat randomised. All devices are always ready to recieve, so by default there is no blocking or transmission delay. Note that this simulator tends to be very "nice", and makes applications that are sensitive to ordering appear to work. epoch_sim is good for initial debugging and dev, but does not guarantee it will work in hardware.

Epoch sim can now also simulate some elements of out-of-order transmission and message over-taking. If you use the --prob-delay option, you can specify the probability that a given message delivery will be delayed. So for example, if you specify a delay probability of 0.5, then every message that can be received in an epoch has a 50/50 chance of actual delivery. Any message that is not delivered will be saved in a buffer, and retried in the next epoch. In the next epoch any buffered messages will again be given a 50/50 chance of delivery. So in general, if you select probDelay>0, the number of epochs taken to receive any given message is a geometric distribution with parameter probDelay.

Example usage:

make demos/ising_spin_fix/ising_spin_fix_16_2.xml
bin/epoch_sim apps/ising_spin/ising_spin_16x16.xml

Environment variables:

Parameters:

bin/graph_sim

This is a much more general simulator than epoch_sim, and is useful for finding more complicated protocol errors in applications. The usage is mostly the same as epoch_sim, though it has a slighly different usage. The biggest change is that it supports the idea of messages being in flight, so a message which is sent can spend some arbitrary amount of time in the network before it is delivered. This also applies to individual destinations of a source message, so the messages in-flight are actually (dst,msg) pairs which are delivered seperately.

Usage:

make demos/ising_spin_fix/ising_spin_fix_16_2.xml
bin/graph_sim apps/ising_spin/ising_spin_16x16.xml

The value of graph_sim is that it allows you to simulate more hostile environments, particularly with respect to out of order delivery of messages. To support this there are two main features:

Environment variables:

Parameters:

Limitations:

Graph rendering and analysis

tools/render_graph_as_dot.py

This takes a graph instance and renders it as a dot file for visualisation. Given a snapshot file it can generate multiple graphs for each snapshot.

Parameters:

bin/calculate_graph_static_properties

Takes a graph instance and calculates a set of basic static properties about the graph, including:

Usage

bin/calculate_graph_static_properties path-to-xml [metadata.json]

The path-to-xml must always be supplied.

If a metadata.json is supplied, then the data calculated by this program will be inserted into the given document (i.e. it will inherit those properties).

Example usage:

$ bin/calculate_graph_static_properties apps/ising_spin/ising_spin_8x8.xml > wibble.json
$ less wibble.json
{
    "graph_instance_id": "heat_8_8",
    "graph_type_id": "ising_spin",
    "total_devices": 64,
    "total_edges": 256,
    "device_instance_counts_by_device_type": {
        "cell": 64
    },
    "edge_instance_counts_by_message_type": {
        "update": 256,
        "__print__": 0
    },
    "edge_instance_counts_by_endpoint_pair": {
        "<cell>:in-<cell>:out": 256
    },
    "incoming_degree": {
        "min": 0.0,
        "max": 4.0,
        "mean": 2.0,
        "median": 1.3333333333333333,
        "stddev": 2.0
    },
    "outgoing_degree": {
        "min": 4.0,
        "max": 4.0,
        "mean": 4.0,
        "median": 4.0,
        "stddev": 0.0
    }
}

tools/render_event_log_as_dot.py

Takes an event log (e.g. generated by bin/epoch_sim) and renders it as a graph. Beware: don't try to render lots of events (e.g. 1000+) or large numbers of devices (30+), as the tools will take forever, and the results will be incomprehensible.

Example usage:

bin/epoch_sim apps/ising_spin/ising_spin_8x8.xml --max-steps 20 --log-events events.xml
tools/render_event_log_as_dot.py events.xml
dot graph.dot -Tsvg -O

Should produce an svg called graph.svg.

tools/render_graph_as_field.py

Takes a snapshot log (generated by bin/epoch_sim), treats it as a 2D field, and renders it back out as a sequence of bitmaps.

Note that this tool assumes two pieces of meta-data exist:

Example usage:

# Create a fairly large gals heat instance
apps/gals_heat/create_gals_heat_instance.py 128 128 > tmp.xml
# Run a graph and record snapshots at each epoc into a file graph.snap
bin/epoch_sim tmp.xml --max-steps 100 --snapshots 1 graph.snap --prob-send 0.5
# Render the snapshots as pngs, rendering the time as colour
tools/render_graph_as_field.py tmp.xml  graph.snap   --bind-dev "cell" "state" "t" "blend_colors( (255,255,0), (255,0,255), 0, 10, (value%10))"
# Convert it to a video
ffmpeg -r 10 -i graph_%06d.png -vf "scale=trunc(iw/2)*2:trunc(ih/2)*2" -c:v libx264 -crf 18 gals_time.mp4

Should produce an mp4 called gals_time.mp4 which shows the per-device evolution of time. Note that you'll need a decent video player like mediaplayerclassic or VLC to handle it; as the built-in windows things often don't work.

Parameters:

You can apply different filters for different device types, and so ignore devices which don't represent pixels.

bin/structurally_compare_graph_types

Takes two graph-types and checks that they are the same in terms of the device and message structures. This does not check whether the code is identical.

bin/topologically_compare_graph_instances

Takes two graph instances and checks that the topology is the same, i.e. it checks that there is the same set of device and edge instances, and that they have the same properties.

bin/topologically_diff_graph_instances

Takes two graph instances and checks that the topology is the same, i.e. it checks that there is the same set of device and edge instances, and that they have the same properties. In addition, it tries to print out a diff which highlights where the actually changes are.

This is similar to bin/topologically_compare_graph_instances, but the ability to also print diffs means it might be slower for large graphs.

Graph manipulation and conversion

tools/preprocess_graph_type.py

Some apps and graphs within graph_schema use pre-processing to embed or genericise the contents of graph types. Any expansions needed to create a standard graph type are done by this tool, which takes a graph type and produces an expanded/pre-processed graph type. Note that the pre-processing done here is in addition to (actually before) any standard C pre-processing.

This tool can be run on any graph type, and will pass through graph types that don't have any graph_schema pre-processing features.

Example usage:

tools/preprocess_graph_type.py apps/ising_spin_fix/ising_spin_fix_graph_type.xml > tmp.xml

If you run the above you'll find that the local header file referenced from the graph type has been embedded into the output xml.

Direct embedding of include files

It is often useful to seperate functionality out into an external header, particularly when you want shared functionality or logic between a graph type and a sequential software implementation. However, it is desireable to create self-contained graph types which don't rely on any non-standard system header files, so that the graph files can be removed to remote machines and run there. The direct embedding pragma allows certain #include statements to be expanded in place when pre-processing the graph type, making the source code self-contained.

Given an include statement:

#include "some_local_header.hpp"

we can prefix it with a graph_schema-specific pragma:

#pragma POETS EMBED_HEADER
#include "some_local_header.hpp"

When the graph is pre-processed the code will be re-written into:

/////////////////////////////////////
//
#pragma POETS BEGIN_EMBEDDED_HEADER "some_local_header.hpp" "/the/actual/absolute_path/to/some_local_header.hpp"

void actual_contents_of_some_local_header()
{
}

#pragma POETS END_EMBEDDED_HEADER "some_local_header.hpp" "/the/actual/absolute_path/to/some_local_header.hpp"
//
////////////////////////////////////

The pre-processed code will then turn up in the headers.

The pragma is allowed in an place where source code appears in the graph type.

Note that this feature is not currently recursive, so if there is a #include in the file being embedded it won't get expanded.

tools/convert_v2_graph_to_v3.py

Takes an XML graph type or graph instance which is written using version 2.* of the schema, and converts this to version 3 of the schema. This converts any __init__ input pins to <OnInit> handlers, and removes any application pin attributes.

Usage:

python tools/convert_v2_graph_to_v3.py <path-to-v2-XML> >> <path-to-new-XML-file>

If no path to a new XML file is provided, the XML will be printed to the screen.

tools/convert_v3_graph_to_v4.py

Takes an XML graph type or graph instance which is written using version 3 of the schema, and converts this to version 4 of the schema. It attempts to preserve semantics, but currently documentation and meta-data are lost.

Usage:

python tools/convert_v3_graph_to_v4.py <path-to-v3-XML> > <path-to-new-XML-file>

tools/convert_v3_graph_to_v4.py

Takes an XML graph type or graph instance which is written using version 4 of the schema, and converts this to version 3 of the schema. It should convert all features of v4 correctly (correct as of the very first v4 spec).

Usage:

python tools/convert_43_graph_to_v3.py <path-to-v4-XML> > <path-to-new-XML-file>

tools/poems

POEMS is a tool which is supposed to exectute a graph instance as fast as possible across multiple cores. POEMS is a more advanced simulator than the others, and is really more of a full-on execution environment rather than a simulator.In some ways it is the successor to bin/queue_sim, but it shares no code with it.

Unlike epoch_sim and friends, POEMS compilers a single standalone executable for each graph type. The executable can load any instance for that graph type, but cannot adapt to other graph types. There are two stages to executing a graph in POEMS:

1 - Compile the graph into an executable using tools/poems/compile_poems_sim.sh, usually producing an executable called ./poems_sim.

2 - Execute the resulting executable on a graph instance by running the simulator and passing the name of the graph.

For example, to build and execute apps/ising_spin/ising_spin_8x8.xml:

tools/poems/compile_poems_sim.sh apps/ising_spin/ising_spin_8x8.xml
./poems_sim apps/ising_spin/ising_spin_8x8.xml

Simulation approach

POEMS is designed to scale across multiple threads of execution, and should have reasonably good weak scaling up to 64 or so cores - certainly on 8 cores you'll see a big benefit as long as there are 10K+ devices. The general approach is to decompose the graph into clusters of devices, where the amount of intra-cluster edges is hopefully quite large. Messages between devices in the same cluster are local messages, while messages between devices in different clusters are non-local. Each cluster maintains a lock-free bag (technically I think it is wait-free) of incoming non-local messages, and any thread can post a non-local message to any cluster. The incoming bag of messages is completely un-ordered, and in order to make it lock-free and O(1) it is pretty much guaranteed that messages will always arrive out of order, and are actually quite likely to be eventually processed in LIFO order.

Each cluster is processed in a round-robin fashion be a pool of threads. The overall cluster step process is:

1 - If we have just completed an idle barrier, then execute the idle handler on all local devices.

2 - Grab the current bag of incoming non-local messages in O(1) time by atomically swapping with an empty bag. Other threads can still post to this cluster while this happens.

3 - For each incoming message in the non-local bag, deliver it to the local device, destroying the bag as you go. Other threads will be posting into a different bag by this point.

4 - For each device in the cluster, try to execute one send handler, or (lower priority) execute the device idle handler. If a device sends a message, then deliver any local edges by directly calling the send handler. Any non-local edges are handled by posting a copy of the message into the input bag of the target cluster.

The stepping process is optimised to try to skip over inactive devices and clusters in a reasonably efficient way (though it would probably be better to have more than one level of clusters).

Idle detection is somewhat complicated due to the need to atomically commit on things amongst multiple threads. The approach used here has two levels:

1 - an initial heuristic level, which tries to guess when all clusters have gone idle.

2 - a precise verifical level, where all the threads start blocking until there is only one left which can verify that the system is completely idle.

There are probably still some threading bugs left in that process, though it is usually quite reliable.

There are a quite a lot of optimisations in the system to try to reduce overheads, so this is less of a simulator and more a genuine execution platform. Optimisations include:

Compiler options

When compiling the poems simulator you can pass a number of options to control output and optimisation level:

Run-time options

When you invoke the simulate you get a number of options to control how it executes:

Suggestions on how to write and debug an application/graph

Many of the tools and features in graph_schema are intended to make the development and debugging of graphs easier, as getting functional correctness is often so difficult. The general idea here is to try to get a particular graph working on all possible platforms, rather than on one particular piece of hardware or one particular simulator. Proving that a graph is correct under all possible platforms is the same as proving that the graph is correct under all possible execution paths, which is often impossible due to the combinatorial explosion of possible execution states. However, we can get some confidence that a graph is correct by running it under many different execution patterns to look for failures. If we can the replay those failures it is possible to understand and debug them.

General approach for writing a graph

While each application is different, a general flow for getting an application functionally correct is:

1 - Write a reference software model in a sequential language. Trying to write a concurrent graph for something you don't understand sequentially is usually a waste of time.

2 - Design the set of state machines on paper. Typically you want two or three sets of sketches:

a - A set of state-machine diagrams, capturing the individual
    device types in the diagram. Each state is a bubble, and
    the edges between states represents messages arriving or
    leaving the device. You probably need to annotate the edges
    with:

    - Whether it is a send or receive (often using a different style for
      each is a good idea).

    - Which pin it is sending/receiving on.

    - Any guards which control whether an arc is taken or not.

b - An example topology diagram, showing how a simple graph instance
    would be constructed from those devices. You usually don't want
    more than 20 or so nodes, but you do want something big enough
    that it isn't trivial.

c - (Optional) A state sequence diagram showing one possible evolution
    of your graph instance over time. These are quite time consuming,
    but a partial state sequence diagram can really help with complex
    graphs.

It is tempting to skip these sketches, but for any reasonably complex
application you are just wasting time as you'll need to draw them
later when debugging.

3 - Convert your state machine diagrams into a graph-type. You should be extremely liberal with the use of assert when doing this, and ideally assert device state invariants at the top of every handler. If you don't know of any device state invariants, that suggests you haven't fully thought through the state machine diagram.

4 - Manually construct a device instance that exactly matches your example topology diagram. If you have up to 20 devices it doesn't take that long, and even the act of constructing it sometimes shows up problems. It may be tempting to construct a generator program at this point, but that may be premature - when manually constructing the instance you might find problems that invalidate your original design.

5 - Simulate your manual instance using epoch_sim. This is the friendliest simulator, which tends to expose the simplest bugs. If you get errors here then debug them (see later).

6 - Simulate your manual instance using graph_sim using a number of strategies. It is a good idea to run it for an hour or so with the random execution strategy and event logging turned on. If you get any errors, then debug the event log (see later). You really want to do this with simple graphs if you want a hope of debugging it, so the earlier you find problems the better.

7 - (Optional) Run it in hardware. There is little point to going to hardware at this point apart from warm fuzzies, but it is nice to do.

8 - Create an instance generator. This should almost always be configurable in size/scale, and in most cases the preference should be for a random generator rather than ingesting real problems (you can always come back to that). If at all possible, you should also design the instance generator to embed self-check information in the generator graph. For example, if there is a defined "end" for device instances, you can embed the final value expected as a property. At run-time each device can assert it is correct at the end state, so you find out where the application has failed.

9 - Check that the application works at a bunch of scales. Think 2^n+-1, primes, and so on.

10 - Move to hardware! There is still a good chance it won't work it hardware, but at least any errors will be obscure and hard to find.

Debugging applications

You have three main tools in your debugging tool-box:

1 - Assertion failures: Try to make the application assert as soon as possible. As noted above, you want to assert any possible program variants you can at the start of each handler. If necessary/possible, augment your device states and/or messages to carry extra debug information that allows you to assert stronger invariants. The fewer send and receive steps that happen before an assertion, the easier it is to work out why it is happening.

2 - Event logs: Some of the graph_scema tools allow you to capture the set of events within a failing program. You can then use them to try to find out what went wrong by looking at the messages sent and the state changes. As with assertions, you want to try to find the shortest event log that will find an error - more than about 100 events becomes very difficult to follow.

3 - Handler logs: print out information about the internal state of devices to try to recover what went wrong. This is a powerful technique for simple bugs, but a weak technique for complex bugs.

If you're more dedicated/desperate you can also bring to bear some more advanced methods:

4 - Checkpointing against predicted device state at key points. There used to be support for this in graph_schema, but it was removed as part of the standardisation for v3 (probably a mistake).

5 - Model checking. There is a tool called 'bin/hash_sim2' which will perform model checking of a graph using a depth-first search. While this tool can be very useful, you have to deal with the standard model checking problems of state explosion. Probably you'll need to adapt your graph to reduce the state-space and encourage merging of states, and you might need to hack the simulator a bit.

5 - Formal methods. Not discussed here.