Open shelbyt opened 8 years ago
We could adapt an approach like X-Trace [NSDI'07] or Pivot Tracing [SOSP15]
@shelbyt the server already keeps track of this at coarser grain---each new state transition is timestamped. Are you trying to do something for which server-side timestamping doesn't work?
@kwantam: I think the idea is that at scale, we're going to need some path-based tracing in order to do performance debugging and introspection into this system. For example, correlating multiple waves of (logically coupled) lambdas, looking for performance outliers, etc.
What you say makes sense, but it's not at all clear to me that what you describe isn't already provided by the timestamping that the coordinator is doing (modulo two edge cases, one of which we can easily fix and one I doubt we'll see much in practice; we can go into these in more detail if you'd like).
Anyhow, to the extent that we need more functionality, we should adopt a solution that doesn't involve adding more special-case functionality to the workers. In fact, my near-term goal (where "near term" is "after the Oakland deadline") is to make the workers even dumber than they are now. (At present the worker code does a couple things behind the scenes that I want to make explicit and more general. Best example: sending state files to neighbors is a special case.)
The above implies that whatever solution we use should involve the coordinator, not the workers.
Hi Riad @kwantam,
I am not very sure about my understanding. But here are my findings. Please correct me if I'm wrong.
Questions:
Sorry if I'm sounding too lame.
@kvasukib
To answer your questions:
Yes, you can do it. The best way is to add a field to the response message from worker to coordinator that tells the coordinator how much time elapsed executing a given command. If we do this, we should make it general: do it in handle_message
and make every function add this timestamp, then have the coordinator parse it out of each response.
But to answer the second part of question 1, no, I don't think it makes sense. As I pointed out above, the error you add by capturing the timestamps on the coordinator instead is almost certainly negligible, and it's not clear what the additional complexity of taking timestamps on the workers buys you.
Moreover, measuring time at the worker makes other things much harder: there is no consistent view of the world across workers (and in particular workers aren't guaranteed to agree on the time), which means that timestamps returned by different workers are incomparable [1]. This means you might still need to take the timestamp relative to the coordinator's view of time (say, if you want to talk about computations where workers act in concert), at which point you've brought back the error term due to network roundtrips. So things get more complex, but we'd have to be very careful or the accuracy wouldn't improve.
Of course, you can throw even more complexity at it by estimating roundtrip times between the coordinator and each worker. But that's crazy---what computations are we running where this level of accuracy matters? I'd guess the answer is "none."
[1] As an aside: the extent to which they're incomparable depends how paranoid we are. Certainly workers are likely to disagree about the absolute time---we should not assume that it's safe to compare worker A's "time since 00:00:00 Jan 1 1970" to the same measurement on worker B.. Do we also think they'll disagree about the rate at which time passes? Well, we're not running a standards lab, so let's say we think we can compare time deltas between workers, i.e., differences of the form "I measured t1 and then measured t2 and the difference is Tdelta".
out_file
is not None
in the server_info
argument to server_main_loop
. All of the timestamps in that file are relative to the timestamp generated at the beginning of the server_main_loop
function.Your questions are not lame at all, they're good questions! I'm sorry I didn't write any documentation :)
I mentioned two corner cases where the current timestamping code breaks or is inconvenient. They are:
When using a CommandListState, there's only one timestamp for the whole state, but generally one uses these states to run a bunch of commands in a row. One can kludge around this by using OnePassStates instead of CommandListStates, but that's a lot of boilerplate. An alternative is to make the CommandListState record a timestamp each time it sends a new command.
But that doesn't quite work, either. The reason is that CommandListStates can be pipelined (though they're not by default). When pipelining is enabled, the coordinator sends all commands at once, and the worker returns responses as each command finishes. This makes timestamping slightly hairier---and in fact it's why the current implementation doesn't support timestamp-per-command.
It's not so hard to work around the above issue and recreate a "non-pipelined" view of the timestamps from the pipelined view; I just never implemented that.
A generalization of the pipelining issue arises with SuperpositionStates. When the coordinator's state machine is in a superposition of worker states, it's emulating a NFA---in other words, there are many different sequences of transitions that could take the state machine from start to finish. The issue isn't with recording timestamps---there are no race conditions (the transitions are serialized because the server is single threaded), so the logs are consistent---but parsing the logs afterward isn't necessarily so easy.
Again, I think this is surmountable: define a canonical representation of the transitions through the SuperpositionState's constituent DFAs, and convert the recorded timestamps into that canonical representation. But this is, for practical purposes, close to a "don't care": I'm guessing SuperpositionStates won't get that much use. No one seems to like them except me. :)
To keep track of where time is spent in each lambda it would be nice to have an additional mode that invokes the context object i.e. context.get_remaining_time_in_millis() around specific functions/lines.
This way we can keep track of overheads vs. s3 access time vs. actual user function code time and have finer grain trace information.