excamera / mu

Framework to Run General-Purpose Parallel Computations on AWS Lambda
94 stars 23 forks source link

Add mode in lambda_function_template to keep track of time #5

Open shelbyt opened 8 years ago

shelbyt commented 8 years ago

To keep track of where time is spent in each lambda it would be nice to have an additional mode that invokes the context object i.e. context.get_remaining_time_in_millis() around specific functions/lines.

This way we can keep track of overheads vs. s3 access time vs. actual user function code time and have finer grain trace information.

gmporter commented 8 years ago

We could adapt an approach like X-Trace [NSDI'07] or Pivot Tracing [SOSP15]

kwantam commented 8 years ago

@shelbyt the server already keeps track of this at coarser grain---each new state transition is timestamped. Are you trying to do something for which server-side timestamping doesn't work?

gmporter commented 8 years ago

@kwantam: I think the idea is that at scale, we're going to need some path-based tracing in order to do performance debugging and introspection into this system. For example, correlating multiple waves of (logically coupled) lambdas, looking for performance outliers, etc.

kwantam commented 8 years ago

What you say makes sense, but it's not at all clear to me that what you describe isn't already provided by the timestamping that the coordinator is doing (modulo two edge cases, one of which we can easily fix and one I doubt we'll see much in practice; we can go into these in more detail if you'd like).

Anyhow, to the extent that we need more functionality, we should adopt a solution that doesn't involve adding more special-case functionality to the workers. In fact, my near-term goal (where "near term" is "after the Oakland deadline") is to make the workers even dumber than they are now. (At present the worker code does a couple things behind the scenes that I want to make explicit and more general. Best example: sending state files to neighbors is a special case.)

The above implies that whatever solution we use should involve the coordinator, not the workers.

mutablealligator commented 8 years ago

Hi Riad @kwantam,

I am not very sure about my understanding. But here are my findings. Please correct me if I'm wrong.

  1. I know that we do time measurements for each state of the state-machine. This is done in libmu/server.py: 278 : timestamps = [ ts - server_info.start_time for ts in state.timestamps ]. And we update this as part of SERVER status on the screen.
  2. For example, we have an UPLOAD state in png2y4m_server.py which uploads a S3 object. So the time spent in the UPLOAD state is measured by the server.py and we see it on the screen when the status gets updated. So this time UPLOAD_TIME includes the time spent in the server.py, state machine, lambda and S3 write.

Questions:

  1. Can we modify code to get fine grained timestamps : just S3 read and S3 write to identify our S3 bottlenecks? Do you think this make sense?
  2. I know the timestamps get printed on the console as part of SERVER status. But for us to analyze and plot graphs, what to do you think will be right approach in terms of writing the timestamps to log file and using the file to plot the graphs.

Sorry if I'm sounding too lame.

kwantam commented 8 years ago

@kvasukib

  1. Hmm, not quite. The code you're looking at there is run after the coordinator finishes, and it's not writing to the screen, it's writing to a file that stores all of the timestamps for all of the state transitions. The code that writes elapsed time to the screen is on line 175.
  2. Again, no, we don't see it on screen. The transition times for each state machine are written to a logfile. And while it's technically correct that the timestamp is only recording the time of the state transition on the coordinator's end (and thus doesn't capture precisely the time spent, say, uploading to S3), it almost certainly doesn't matter: I looked at this and found that the latter dominates by at least an order of magnitude, so the loss in accuracy is negligible.

To answer your questions:

  1. Yes, you can do it. The best way is to add a field to the response message from worker to coordinator that tells the coordinator how much time elapsed executing a given command. If we do this, we should make it general: do it in handle_message and make every function add this timestamp, then have the coordinator parse it out of each response.

    But to answer the second part of question 1, no, I don't think it makes sense. As I pointed out above, the error you add by capturing the timestamps on the coordinator instead is almost certainly negligible, and it's not clear what the additional complexity of taking timestamps on the workers buys you.

    Moreover, measuring time at the worker makes other things much harder: there is no consistent view of the world across workers (and in particular workers aren't guaranteed to agree on the time), which means that timestamps returned by different workers are incomparable [1]. This means you might still need to take the timestamp relative to the coordinator's view of time (say, if you want to talk about computations where workers act in concert), at which point you've brought back the error term due to network roundtrips. So things get more complex, but we'd have to be very careful or the accuracy wouldn't improve.

    Of course, you can throw even more complexity at it by estimating roundtrip times between the coordinator and each worker. But that's crazy---what computations are we running where this level of accuracy matters? I'd guess the answer is "none."

    [1] As an aside: the extent to which they're incomparable depends how paranoid we are. Certainly workers are likely to disagree about the absolute time---we should not assume that it's safe to compare worker A's "time since 00:00:00 Jan 1 1970" to the same measurement on worker B.. Do we also think they'll disagree about the rate at which time passes? Well, we're not running a standards lab, so let's say we think we can compare time deltas between workers, i.e., differences of the form "I measured t1 and then measured t2 and the difference is Tdelta".

  2. The timestamps are already written to the logfile, as I pointed out above, as long as out_file is not None in the server_info argument to server_main_loop. All of the timestamps in that file are relative to the timestamp generated at the beginning of the server_main_loop function.

Your questions are not lame at all, they're good questions! I'm sorry I didn't write any documentation :)

kwantam commented 8 years ago

I mentioned two corner cases where the current timestamping code breaks or is inconvenient. They are:

  1. When using a CommandListState, there's only one timestamp for the whole state, but generally one uses these states to run a bunch of commands in a row. One can kludge around this by using OnePassStates instead of CommandListStates, but that's a lot of boilerplate. An alternative is to make the CommandListState record a timestamp each time it sends a new command.

    But that doesn't quite work, either. The reason is that CommandListStates can be pipelined (though they're not by default). When pipelining is enabled, the coordinator sends all commands at once, and the worker returns responses as each command finishes. This makes timestamping slightly hairier---and in fact it's why the current implementation doesn't support timestamp-per-command.

    It's not so hard to work around the above issue and recreate a "non-pipelined" view of the timestamps from the pipelined view; I just never implemented that.

  2. A generalization of the pipelining issue arises with SuperpositionStates. When the coordinator's state machine is in a superposition of worker states, it's emulating a NFA---in other words, there are many different sequences of transitions that could take the state machine from start to finish. The issue isn't with recording timestamps---there are no race conditions (the transitions are serialized because the server is single threaded), so the logs are consistent---but parsing the logs afterward isn't necessarily so easy.

    Again, I think this is surmountable: define a canonical representation of the transitions through the SuperpositionState's constituent DFAs, and convert the recorded timestamps into that canonical representation. But this is, for practical purposes, close to a "don't care": I'm guessing SuperpositionStates won't get that much use. No one seems to like them except me. :)