Allow external device keys to come from multiple FPGA sources

In the current code, and external device which is connected to an FPGA can be specified to be fed from a single FPGA id and FPGA link id. We are currently looking at an interface that distributes the traffic between the different FPGA links in order to ensure as high a bandwidth transfer as is possible. It would therefore be helpful to have the ability to specify a single external device using ApplicationFPGAVertex but then be able to specify multiple FPGA id/FPGA link id pairs, or even just say "all connections on this board" or "all links from this FPGA".

Some potential thoughts / issues:

The ApplicationFPGAVertex could have an ExternalDeviceSplitter associated that then resolves it into multiple MachineFPGAVertex instances, one for each of the source vertices.
The current use case specifies that in this case, the same keys would potentially originate from all FPGA links specified.
- This means that routing key allocation should assign the same key to multiple MachineFPGAVertex instances if they have the same ApplicationFPGAVertex. Although this needs to be added, it shouldn't break any existing functionality because there is currently a one-to-one mapping between ApplicationFPGAVertex and MachineFPGAVertex.
- This also means that if fixed keys are specified at the application level, those keys should be given to all machine vertices of an ApplicationFPGAVertex. Routing key allocation algorithms should also check this.
- An additional implication is that routing algorithms should ensure that they send traffic from these separate locations down the same links on any router where they might cross. This will avoid issues where a single packet could end up multiplying due to routing considerations.
- It isn't currently possible to route an external device to itself, but this is a "feature" that has arisen because this would require recognition of such edges to ensure that a non-virtual router is given a routing entry (fairly easy to do, but not currently done). An ApplicationFPGAVertex with multiple FPGA links as sources must not route to itself, or it will create a routing loop!
- It is hoped that only stopping an external device routing to itself is enough to avoid a loop; further analysis may be needed to ensure that this is the case.

One idea would be to temporarily use different keys during routing. For example if you have 4 machine vertexes increase the fixed key mask by an extra 2 bites. In the 4 fpg set neither, the first, the second, both. In the receiver set the two bits back to zero.

Things to watch out for

There is not other fixed that the larger key will clash with
The receives does not set to zero bits in keys from other sources.

After some investigation, it appears that it is not possible for the device to send the same keys to multiple FPGAs without a lot of disruption. In particular, this would restrict the placement of receiving cores to ensure that they are not on any of the FPGA-connected cores, as otherwise loops will happen.

The current design is that the keys and sending device is set up so that pixels of a source retina that are close to each other are sent to different FPGAs i.e. the LSB of the dimension-fields in the keys are used to determine which FPGA the pixels are sent to For example, take a key with fields: | key = 12 bits | polarity = 1 bit | y = 9 bits | x = 10 bits |

If we have 8 FPGA links to send over, the 1 LSB of y and 2 LSBs of x can be used to determine which FPGA link to send over, giving a mask of 0xFFF00403.

The next challenge with this layout is to send appropriate squares to the appropriate receivers i.e. in the case being considered, the receivers are convolution populations. This means that there are multiple sources of the keys to be received by each target core. Specifically, it is not desirable to receive all the keys at all of the target cores, as this means that the targets have to deal with more keys than they can handle.

Possible ideas:

Simply filter the edges. Challenge: the above key and mask represent multiple pixels in the image, some of which should come to the core, and some of which shouldn't. This would instead require splitting the source into multiple individual pixels rather than subsets, leading to a lot more routing entries. It might be possible for these to be compressed.
Mark the edge with an extra constraint to indicate that it needs further filtering. Normal routes will be generated for all the keys, but at the point of routing table generation, these can be filtered by using a different key and mask at the receiver than at the sender. In the extreme (but simplest case), this would mean changing the routing entries at the target to split out the keys wanted from the keys not wanted, with the addition that any keys not wanted would need to have a routing entry that targeted nothing. This is similar to the previous bit-field router compression scenario, but might be helped by the fact that the source and target key structure is known. Example:
- Sources are as above with 8 FPGA sources, from a retina with 640 (10-bits) x 480 (9-bits) pixels, both polarities sent to the same targets. The source mask is 0xFFF00403. Assume a retina base key, in the first 12 bits, of 0x0 meaning the routing keys for polarity 0, for FPGA links 0-7, are: 0x00000000, 0x00000001, 0x00000002, 0x00000003, 0x00000400, 0x00000401, 0x00000402, 0x00000403.
- Multiple targets are used to cover the 640x480 source key space.
- Each target can handle a 16x16 space, so the mask at the receiver is 0xFFF7C3F0, which means there are 12 bits for the source-specific key (0xFFF), a 1 bit hole for polarity followed by 1s in the most-significant 5 bits of y, 0s in the least-significant 4 bits of y, 1s in the most-significant 6 bits of x and 0s in the remaining bits of x.
- The "first" core wants to receive the top-left 16x16 space (i.e. min value of x, max value of y), which is represented by keys of 0x0007C000 (for polarity 0) and 0x000FC000 (for polarity 1).
- From the source overall, the core therefore doesn't want keys that don't match this. This can be achieved with a routing entry placed in the table after the above which says all keys from the source go nowhere i.e. key = 0x0, mask = 0xFFF00000 (to match the 12-bits that make up the base key), route = 0.
- If the route for the source continues beyond this chip as well as targeting cores on this chip, the final route and all added extra routes should reflect this. So the links out are kept for all routes added.
- Where multiple cores on the same chip handle different parts of the same source, the keys that are to match for a particular core must be added before the final "filter all" mask. It should be possible to determine if this final entry is needed at all by grouping entries by source application vertex.
- Where there are multiple edges with similar constraints, they should be grouped by the application-level source. There can then be a final filtering route added on any chip where any of the application-level sources reach, but not all cores on that chip are targeted.

Thinking about this again, the aim is that a) the FPGA packets arrive such as to maximize the bandwidth of the reception and b) the cores that receive the packets minimize the number of unwanted packets received. With the above encoding, adjacent pixels are sent to different FPGA links, which would appear to maximize bandwidth. However as it stands, each of these FPGAs will be represented by a single virtual machine vertex. This means that any core that wants to receive any of the pixels being sent by that FPGA link will have to receive all of them. The above then suggests a mechanism that would reduce the load on the cores, but would end up still putting the full load on any router between the FPGA and any target chip, since there will be a single machine edge involved in that link.

This is shown in the picture below. In this picture, all pixels that are received on FPGA link 0 are shown in red, and similarly those from FPGA link 3 are shown in blue, link 4 in green and link 7 in yellow (others are not shown to reduce the cluttering). So to receive the pixels that make up the black square, all the pixels from all the FPGAs must be received at least at the chip.

Retina FPGA Mapping

Another way to split things is to therefore have multiple machine vertex sources, each of which then groups more local pixels together again. Again, taking the 640x480 image, using the mask of 0xFFF00403 we have now split this into 2x4 rectangles, where one pixel from each rectangle is sent to a different FPGA. We can now choose to group these rectangles into chunks of e.g. 16x16 pixels (i.e. a group of 8x4 of the 2x4 rectangles). Each of these can now be assigned to appear to be sent by separate machine vertices of the virtual device, which means that this device will send a subset of the pixels received by an FPGA. Now to receive the black square in the diagram again, the receiver would only need to receive the pixels in that square which can be done by filtering edges. Receiving the purple square would still require extra pixels to be received, but still less than the whole image.

Fixed; this is now implemented in current code

SpiNNakerManchester / PACMAN

Allow external device keys to come from multiple FPGA sources #397