This PR makes it possible to choose from several pre-defined connectivity patterns between ranks when defining the graph neural network. The legacy implementation had the following hard-coded message passing scheme: $$(0 \rightarrow 0), (0 \rightarrow 1), (1 \rightarrow 1), (1 \rightarrow 2)$$ In this notation, $a \rightarrow b$ means that cells of rank $a$ will send messages to cells of rank $b$. Therefore, the legacy implementation made it impossible to consider other connections, e.g. there was no way to make cells of rank 1 send messages to cells of rank 0, or to make 0 send messages to 2 directly.
This PR introduces 7 pre-defined connectivity patterns to choose from. Namely:
"self_and_next" generates adjacencies where each rank is connected to itself and the next (higher) rank.
"self_and_higher" generates adjacencies where each rank is connected to itself and all higher ranks.
"self_and_previous" generates adjacencies where each rank is connected to itself and the previous (lower) rank.
"self_and_lower" generates adjacencies where each rank is connected to itself and all lower ranks.
"self_and_neighbors" generates adjacencies where each rank is connected to itself, the next (higher) rank and the previous (lower) rank.
"all_to_all" generates adjacencies where each rank is connected to every other rank, including itself.
"legacy" ignores the max_dim parameter and returns ['0_0', '0_1', '1_1', '1_2'].
When calling the main training script, the desired pattern can be set e.g. via the flag: --connectivity self_and_next.
The theory suggests that there may be little difference between these options. It will be interesting to see if we can empirically validate that claim.
Description
This PR makes it possible to choose from several pre-defined connectivity patterns between ranks when defining the graph neural network. The legacy implementation had the following hard-coded message passing scheme: $$(0 \rightarrow 0), (0 \rightarrow 1), (1 \rightarrow 1), (1 \rightarrow 2)$$ In this notation, $a \rightarrow b$ means that cells of rank $a$ will send messages to cells of rank $b$. Therefore, the legacy implementation made it impossible to consider other connections, e.g. there was no way to make cells of rank 1 send messages to cells of rank 0, or to make 0 send messages to 2 directly.
This PR introduces 7 pre-defined connectivity patterns to choose from. Namely:
"self_and_next"
generates adjacencies where each rank is connected to itself and the next (higher) rank."self_and_higher"
generates adjacencies where each rank is connected to itself and all higher ranks."self_and_previous"
generates adjacencies where each rank is connected to itself and the previous (lower) rank."self_and_lower"
generates adjacencies where each rank is connected to itself and all lower ranks."self_and_neighbors"
generates adjacencies where each rank is connected to itself, the next (higher) rank and the previous (lower) rank."all_to_all"
generates adjacencies where each rank is connected to every other rank, including itself."legacy"
ignores the max_dim parameter and returns['0_0', '0_1', '1_1', '1_2']
.When calling the main training script, the desired pattern can be set e.g. via the flag:
--connectivity self_and_next
.The theory suggests that there may be little difference between these options. It will be interesting to see if we can empirically validate that claim.