Pixelfly is a binary hypercube!

justheuristic commented 2 years ago

Okay, so imagine the full pixelfly matrix with 2^n blocks

Let's give each input block a number 0... 2^n-1 Then, the pixelfly matrix can be defined as such:

for every pair of input blocks i, j
if binary(i) xor binary(j) has all zeros ones except one, then matrix[i, j] is a non-zero block
- informally, if binary codes for i and j are equal in all but one bit, "connect" them with weights

This is the same condition that defines a binary cube in n dimensions:

Ergo, pixelfly neurons actually form a cube and "connect" over the edges of said cube.

p.s. not my original thought, discovered in discussion with @ostroumova-la

p.p.s. if that is the case, are there other geometric shapes we could try? So, for instance, fully connected matrix can be viewed as an n-dimensional simplex (triangle -> tetrahedron -> simples) because all blocks connect to all other blocks. Than goes the hypercube of pixelfly. Then what?

tridao commented 2 years ago

This is super cool! I love this interpretation :D

It might relate to how this butterfly pattern of connection was used in telephone switching networks and computer networks

Would love to chat more!

justheuristic commented 2 years ago

Thanks for the reference. I did not know about telephone switching, but i've seen NVIDIA use the same pattern for adding vectors in parallel in scientific computing.

Would love to chat as well. We can find each other on discord (i'm yozh there) -- or your alternatives.

Random thoughts:

1. it is possible to halve the max distance in a hypercube graph by adding one more "edge" (in discussion with @denkorzh ) This is known as folding the cube: connecting each node to the single farthest other node. Hence, node 1101 would connect to 0010. This corresponds to adding one more "1" to each row in the layout matrix. The max length changes from log2 num_blocks to (log2 num_blocks)/2. In other words, the information can travel from any unit to any other unit with half as many layers.

This may be useful if we scale pixelfly to very large models, e.g. GPT-3 has 12288 hidden units, this would translate to a 256x256 grid of 48x48 blocks. It would take 8 consecutive layers to propagate signal from block 00000000 to block 11111111. After folding, we get marginally more parameters, but it now takes only 4 layers (instead of 8) to connect any pair of blocks.

2.Relation to efficient convolutions and GNN / GCN ( proposed by @TimDettmers )

we can view it as a graph-convolutional network operating on a regular graph (i.e. hypercube)
the only major exception - GCN uses the same weights across all vertices, while pixelfly has individual weights
it could be possible to scale to even larger networks by emulating GCN's weight pattern
we could use the same weights for all "horizontal" edges, same weights for "vertical" edges, etc - to train an obscenely large network without having too many parameters

3. There are directed graphs that have similar properties with less weights and/or compute due to assymetry:

exponential graph - used in decentralized training (Assran et al, 2018, https://arxiv.org/abs/1811.10792 )
- Kautz graph ( https://en.wikipedia.org/wiki/Kautz_graph ), also used in HPC and even some DHT variants. (DHT are distributed key-value databases used in BitTorrent, some blockchains, and recently distributed deep learning)

[projected usefulness = low] 4. product spaces

A fully connected layer is equivalent to a d-dimensional complete graph (all to all connections)
A "non-blocky" butterfly is a regular hypercube
A "blocky" butterfly is a product of (hypercube x complete because each block is internally fully-connected
We could combine (butterfly x bipartite) instead -- in order to further reduce the number of parameters
From an engineer's standpoint, this corresponds to using low-rank matrices inside each block

[projected usefulness = none] 5. (low) there are two "brethren" to the binary hypercube that are more dense (in discussion with @denkorzh )

Ternary (3-ary) cube instead of binary. Connect each node to 2 others per dimension -- this generalizes to a hamming cube. ( https://en.wikipedia.org/wiki/Hamming_graph - sorry for wiki links, i learned them from sources in russian language)
Demicubes: produced by "truncating" the hypercube in half - they are about twice as dense (basic info on them)

justheuristic commented 2 years ago

Update ( from @ostroumova-la )

There are directed graphs that have similar properties with less weights and/or compute due to assymetry:

Actually, exponential graph has exactly the same diameter as a hypercube but worse theoretical properties (e.g. it cannot be folded). Therefore, in theory, a folded hypercube should be superior to it in all ways: 2x smaller diameter and just as redundant for the price of 1 extra edge. However, this does not guarantee practical performance for transformers.

Kautz graphs are a whole different matter. For a given hypercube, you can find a Kautz graph with better sparsity and a smaller diameter at the cost of less redundancy. In other words, any block of inputs can contribute to any other block of inputs through other blocks. In Kautz graph, you meed a smaller number of steps (layers), but there are less possible paths along which you can contribute

Kautz(m=4, n=3) => V=320, E/V=4.0, diameter=4             | hypercube V=256 E/V=8 diameter=8 -or- E/V=9 diameter=4
Kautz(m=4, n=4) => V=1280, E/V=4.0, diameter=5           | hypercube V=1024 E/V=10 diameter=10 -or- E/V=11 diameter=5
Kautz(m=4, n=5) => V=5120, E/V=4.0, diameter=6           | hypercube V=4096 E/V=12 diameter=12, etc...
Kautz(m=5, n=2) => V=150, E/V=5.0, diameter=3
Kautz(m=5, n=3) => V=750, E/V=5.0, diameter=4
Kautz(m=5, n=4) => V=3750, E/V=5.0, diameter=5
Kautz(m=5, n=5) => V=18750, E/V=5.0, diameter=6
Kautz(m=6, n=2) => V=252, E/V=6.0, diameter=3
Kautz(m=6, n=3) => V=1512, E/V=6.0, diameter=4
Kautz(m=6, n=4) => V=9072, E/V=6.0, diameter=5
Kautz(m=6, n=5) => V=54432, E/V=6.0, diameter=6
Kautz(m=7, n=2) => V=392, E/V=7.0, diameter=3
Kautz(m=7, n=3) => V=2744, E/V=7.0, diameter=4         | hypercube V=2048 E/V=11 diameter=11 or E/V=12 diameter=6

source code

``` import igraph for m in range(1, 8): for n in range(1, 6): g = igraph.GraphBase.Kautz(m, n) edges = g.get_edgelist() src_ix, dst_ix = zip(*edges) num_vertices = max(max(src_ix), max(dst_ix)) + 1 num_edges = len(edges) diameter = g.diameter() if num_vertices < 100: continue print(f"Kautz({m=}, {n=}) => V={num_vertices}, E/V={num_edges / num_vertices}, {diameter=}") ```

Again, this may not reflect deep learning performance, but it just might work for transformers -- and it has less edges -> less compute.

HazyResearch / fly

Pixelfly is a binary hypercube! #7