HazyResearch / fly

Apache License 2.0
181 stars 20 forks source link

Pixelfly is a binary hypercube! #7

Open justheuristic opened 2 years ago

justheuristic commented 2 years ago

Okay, so imagine the full pixelfly matrix with 2^n blocks

image

Let's give each input block a number 0... 2^n-1 Then, the pixelfly matrix can be defined as such:

This is the same condition that defines a binary cube in n dimensions: image

Ergo, pixelfly neurons actually form a cube and "connect" over the edges of said cube.

p.s. not my original thought, discovered in discussion with @ostroumova-la

p.p.s. if that is the case, are there other geometric shapes we could try? So, for instance, fully connected matrix can be viewed as an n-dimensional simplex (triangle -> tetrahedron -> simples) because all blocks connect to all other blocks. Than goes the hypercube of pixelfly. Then what?

tridao commented 2 years ago

This is super cool! I love this interpretation :D

It might relate to how this butterfly pattern of connection was used in telephone switching networks and computer networks

Would love to chat more!

justheuristic commented 2 years ago

Thanks for the reference. I did not know about telephone switching, but i've seen NVIDIA use the same pattern for adding vectors in parallel in scientific computing.

Would love to chat as well. We can find each other on discord (i'm yozh there) -- or your alternatives.

Random thoughts:

1. it is possible to halve the max distance in a hypercube graph by adding one more "edge" (in discussion with @denkorzh ) This is known as folding the cube: connecting each node to the single farthest other node. Hence, node 1101 would connect to 0010. This corresponds to adding one more "1" to each row in the layout matrix. The max length changes from log2 num_blocks to (log2 num_blocks)/2. In other words, the information can travel from any unit to any other unit with half as many layers.

This may be useful if we scale pixelfly to very large models, e.g. GPT-3 has 12288 hidden units, this would translate to a 256x256 grid of 48x48 blocks. It would take 8 consecutive layers to propagate signal from block 00000000 to block 11111111. After folding, we get marginally more parameters, but it now takes only 4 layers (instead of 8) to connect any pair of blocks.

2.Relation to efficient convolutions and GNN / GCN ( proposed by @TimDettmers )

3. There are directed graphs that have similar properties with less weights and/or compute due to assymetry:

[projected usefulness = low] 4. product spaces

[projected usefulness = none] 5. (low) there are two "brethren" to the binary hypercube that are more dense (in discussion with @denkorzh )

justheuristic commented 2 years ago

Update ( from @ostroumova-la )

  1. There are directed graphs that have similar properties with less weights and/or compute due to assymetry:
source code ``` import igraph for m in range(1, 8): for n in range(1, 6): g = igraph.GraphBase.Kautz(m, n) edges = g.get_edgelist() src_ix, dst_ix = zip(*edges) num_vertices = max(max(src_ix), max(dst_ix)) + 1 num_edges = len(edges) diameter = g.diameter() if num_vertices < 100: continue print(f"Kautz({m=}, {n=}) => V={num_vertices}, E/V={num_edges / num_vertices}, {diameter=}") ```

Again, this may not reflect deep learning performance, but it just might work for transformers -- and it has less edges -> less compute.