codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

reduce memory usage of torus (Imported #67) #67

Closed nmcglo closed 5 years ago

nmcglo commented 8 years ago

Original Issue Author: Jonathan Jenkins Original Issue ID: 67 Original Issue URL: https://xgitlab.cels.anl.gov/codes/codes/issues/67


Looking into the event count of dragonfly (and most of it applies for the torus queuing), majority of the events are being generated by the splitting the packets into chunks (chunk size is 64 bytes).

So for example, a message size of 4MB will generate an event population of 65K packet generate events per originating node. In the case of a very small scale dragonfly having 72 nodes and 36 routers , sending a 4MB will generate roughly 2M arrival events at the routers and the same amount of departure events. Note that the number of routers traversed in minimal routing is 4 so we can have around 16M events generated for a 4MB message.

Fixing the chunk issuance in a loop brings down the event population to some extent but not much since the primary bottleneck in the dragonfly is a router. In case of torus, long queues are formed at the torus link especially if torus is of lower dimension. One down side of having sequential chunk issuance (as we saw in the case of dragonfly) is that it will affect the number of packets queued at the terminals. With that, the maximum number of packets queued at the terminals/node VCs in dragonfly is 2. That is acceptable in dragonfly since the real queues are formed at the routers, not terminals/nodes.

The thing that has substantially helped in bringing down the event population for dragonfly is minimizing router events as much as possible. With the recent changes in repo, the dragonfly routers have only two events, a packet forward event instead of separate router arrival and departure events and a router buffer update event. That alone has bought down the event count of the routers by 40% and the overall simulation event count by ~28%, this also improves the simulation runtime issue in optimistic mode (thats for a small-scale 72 node simulation, we might see more of a difference with a large-scale run having larger messages).

Similar squeezing of events needs to be done for the torus. Having only one event for one hop would substantially bring down the event population. Thats the next thing on the list.

nmcglo commented 8 years ago

Jonathan Jenkins:

The newer versions of torus and dragonfly models have an updated congestion control system which keeps excessive packets from issuing. This has greatly reduced the memory consumption of both models. Additionally, MPI Simulation layer and the workloads API have been changed (model-net-mpi-replay now) to handle unnecessary mallocs that were previously being used to hold the workload operation (codes_workload_get_next_rc2).

nmcglo commented 8 years ago

Jonathan Jenkins:

Taking another look at Misbah's solution, there may still be issues, though arguably less severe than before. The GENERATE/SEND loop still happens within 2 codes-local-latency calls. In order for the event population to remain stable, we'd need to retire the chain of events from a single chunk (ie the event in which the chunk has arrived at its destination) at roughly the same rate we generate GENERATE events. Since the network transfer time is likely many times larger than codes-local-latency (currently 0.5-1ns), we'll generate more chunk SEND/ARRIVE events than are consumed.

One way to help mitigate in the static routing case is to issue the next GENERATE event from a SEND event when the chunk sender is the originating LP, using next_link_available_time as the time delta. This ensures that the event population is proportional to the network diameter, rather than the number of chunks.

nmcglo commented 8 years ago

Jonathan Jenkins:

Misbah found a far simpler way - pseudocode in the previous comment is too clever by half. We can modify the N chunks at once approach to have an event loop of issuing a chunk SEND event followed by another GENERATE event one codes-local-latency later. Since both are self-messages, the chunk SEND will be processed before the next GENERATE. The messages will be packed very tightly together in terms of sim time, which might result in avalanche RCs in optimistic mode, but we can cross that bridge when we get to it.

nmcglo commented 8 years ago

Jonathan Jenkins:

Off-topic, but adaptive routing in the torus would probably break our model... it's a possibility that the "last chunk" (the one that ends up executing the remote event) isn't the one that arrives last. Should adaptive routing make it into the torus model, we'll need to be mindful of it.

nmcglo commented 8 years ago

Jonathan Jenkins:

An event loop for scheduling across multiple links irrespective of routing decisions would look something like this. It's similar to the modelnet sched loop, except that loop starting/stopping conditions are different.

state:
  msg_queue
  link_status array (USED, IDLE)
  sched_status (RUNNING, NOT_RUNNING)

new msg event (really, new "packet", but packets are meaningless in the torus at the moment):
  add msg to msg_queue
  if sched_status == NOT_RUNNING:
     send sched next event

sched next event:
  if msg_queue empty:
    sched_status = NOT_RUNNING
    return
  for msg in msg_queue:
    link i = routing path for msg
    if link_status[i] == IDLE:
      send next chunk, update msg in msg_queue
      link_status[i] = USED
      send link idle event for next idle time
      *send sched next event
      *return
  // couldn't schedule anything, shut down until link available
  * sched_status = NOT_RUNNING

link idle event:
  if msg_queue not empty && sched_status == NOT_RUNNING
    sched_status = RUNNING
    send sched next event

Idea here is to schedule all until we either clear the queue or we can't find an IDLE link that a message is eligible for. In the latter case, a link completion starts up the scheduler again.

nmcglo commented 8 years ago

Jonathan Jenkins:

At a basic level, what would be required is to, instead of issuing all the chunks at once and having packet_generate manage the links from event to event, have only the first chunk be issued, and in packet_generate instead bounce a next-available message with time corresponding to that link's next_link_available_time back to a chunk issuing loop (continuing to pass message metadata through events). This way, the number of pending events in the pool at any given time are the number of modelnet packets. Since we're eschewing the modelnet packetization, the number of pending events from the senders point of view is simply the number of concurrent model_net_event's going on.

nmcglo commented 8 years ago

Jonathan Jenkins:

More investigation indicates to me that what my previous comment is OK, but only if static routing is used. Essentially, each packet would cause an event loop of "issue first chunk -> send chunk -> callback -> send chunk -> ..." on the outgoing link. Multiple packets going through the same link would naturally interleave.

The problem is if dynamic routing is used (as in Nikhil's uiuc_tracer branch), in which we can't tell ahead of time where each chunk would be forwarded. I don't have a good solution in this case, but it is clear that the packet issuance loop would become much more complicated. Would need a management event loop that looks through pending packets, determines where they would be routed and if the link is available, then schedules chunks to a particular link, managing callbacks from the transfer events. Ew.

Not sure how much of this applies to the dragonfly - the terminals are unidirectional, so I'd imagine we can pretty easily convert to use a chunk issuance loop without these kind of headaches, regardless of what routing method is used.

nmcglo commented 8 years ago

Jonathan Jenkins:

Status changed to closed