codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

VC starvation bug in dragonfly-dally #237

Open kevinabrown opened 6 months ago

kevinabrown commented 6 months ago

The situation: Some globally routed packets get stuck in the network and stop reaching their destinations when the network is heavily loaded.

Background and reproduction with nearest neighbor traffic

Traffic: Nearest neighbor uses fixed-pair communication. Terminal X sends to terminal (X+1)%num_terminals. Three types of traffic:

VC usage: Each router port has 4 VCs. Packets start in VC0 when they enter their source router. For subsequent routers, when choosing a VC to store an incoming packet, the deadlock avoidance algo in dragonfly dally chooses a VC based on where the packet is along its journey. For example, when a packet leaves it's source group, it is placed in the next higher VC #. For example, if it was in VC0 in the source group, it will be in VC1 in the next group. There are some more cases, but the packet never moves to a lower numbered VC anywhere along its path.

The cause: When a port is sending packets, the VC arbitration algo chooses which VC to pick from. In CODES, the algo loops over all VCs and checks if they have any packets waiting. It breaks the loop and sends the first packet it finds. The loop always starts from 0, so it always check VC0 and finds a packet since the network is loaded with packets going to neighbors in the source group. Higher numbered VCs are starved, and their packets that have been routed globally are never delivered. This algo can be described as priority-first, where lower numbered VC have higher priority.

There are some other bugs and design issues that compounded this problem, but they are not relevant to the fix or our study so I won't discuss them here.

A fix We can change the VC arbitration algo from priority-first to round-robin. This means, if the port had previously sent a packet from VC0, it will next start checking from VC1 and keep looping over the VC in a round-robin manner.

helq commented 5 months ago

A partial fix (by @kevinabrown following the strategy proposed above) can be found in commits 8e0f4501acbefb3682e1a94238ef00cd957623a5 and 98aba5e10ad767c94f4ca6e1ca3aeddab52a9148