What is the Problem Being Solved?

Distributed systems make it notoriously difficult to get a picture of what's going on. While debugging one program over here, you'd like to know what's happening to that program over there, and what messages are being exchanged. You'd like to be able to run a debugger against this one component, without pausing the entire world. We add console.log messages to trace code we've written, delete it when we're done because the output is too noisy, and then wish we'd left it in place when a similar problem pops up a month later. We want structured logging to record more than mere strings (and use proper analysis tools rather than just grep). We sometimes want statistical logging, where the time spent in each operation is interesting but we're willing to decimate or average out data if it becomes too large. But much of the time we want complete logging, where every message is important.

We've all worked on systems with these sorts of problems, and we've all built or adopted logging systems to address them. I'd like to collect our favorite ideas and tools here, with an eye towards building a Grand Agoric Logging System. Our platform is fairly unique in that we have deterministic execution of inbound messages. So if we log the right things, we should be able to e.g. refer to a message that was delivered on the chain last week, fetch the receiving vat's previous state from a number of chain explorer databases, check that state for accuracy against validator announcements, rebuild a local vat to the exact same JS heap, put it under the control of a debugger, and then re-deliver the message to see what happens. Like time-travelling debugging except better.

This ticket is to colect ideas and systems. Please attach comments with presentation-like writeups of:

logging systems you've used in the past: what was good about them, what was bad
features you've been grateful to have: how did it help you debug that one really tough problem?
features you've wanted but never found a tool for
debugging techniques you've used that would have been improved with better logging tooling

Foolscap

Foolscap (home page and github project) is a remote messaging system I built in 2006, inspired by CapTP/VatTP and ocap principles. It is written in Python, which is a bit too permissive to really support ocap confinement, but of course I found that following the rules anyways led to better partitioning of functionality and code that was easier to read. The networking stack was based on Twisted. The primary customer was the Tahoe-LAFS distributed+encrypted storage system (which I co-authored in 2008), although it was overkill and Tahoe is now moving to a simpler system.

In Foolscap, every remotely-accessible object had a "FURL", the Foolscap URL (SturdyRef-as-data), which was shaped like a web-key: pb://TUBID@CONNECTION-HINTS/SWISSNUM. Foolscap provides arbitrary-object-graph serialization of arguments and return values, TLS-based transport security (the FURL's TUBID is a hash of the expected TLS certificate), and a Promise-like eventual-send mechanism based on Twisted's Deferred (but lacking promise-pipelining). The Foolscap "Tub" was like an E "Vat", and hosted the main APIs. Programs could either connect to a FURL once:

def gotRef(remoteReference):
  remoteReference.callRemote("hello", arg1, arg2).addCallback(handleResponse)
tub.getReference(FURL).addCallback(gotRef)

or they could use a reconnect-on-failure tub.connectTo(FURL, connectHandler) API to maintain a connection despite interruptions.

Foolscap also includes a logging system with a detailed philosophy document. I wrote this to support several aspects of Tahoe:

A UI request (like "upload this file") might provoke a hierarchy of lower-level operations, many of which involved messages to remote systems. To upload, we must find active storage servers, query them for available space, encrypt segments of the file, erasure-code the ciphertext, upload the shares to selected servers, confirm arrival, and finally return a cryptographic "filecap" to the caller. I wanted to be able to correlate the lower-level operations and the higher-level requests.
Tahoe did a lot of work for each file, and didn't transfer as fast as we would have liked. Many operations could be optimized if we knew how long they took. Some of the logged data included runtimes, and the status/results UI could show what speeds the various phases (encrypt, encode, push) of an upload were achieving.
The code had extensive invariant checks, which were occasionally violated. When something went wrong, we really wanted to know what was the higher-level operation that provoked it. The exception raised was the last page of a tale of woe, and to fix things we needed to know the path which led up to that moment.
Tahoe-LAFS was an open-source product of a commercial backup startup, and we ran a fleet of managed storage servers (each running the same code an individual user might run). If something funny happened on one of those servers, we wanted to know about it.
We had some known bugs which occasionally triggered warnings, but were sufficiently low-priority that we didn't get around to fixing them. Part of the logging system could classify these "Incidents" (according to filters that we wrote once we identified the known-bug), and set them aside (with a counter). We could get flagged if any new incidents occurred, or if the rate of the known problem rose above a threshold.

The Logging API

Foolscap's core logging API was a simple function log.msg(format, level, parent, **details), where everything was optional except the format string (**details is Python syntax for a collection of keyword arguments, as a dictionary). On its own, this recorded (format, details) into a JSON-serializable record, so log.msg("uploading %(filename)s to server %(servername)s", filename="foo.txt", servername="a74") saved enough information to be processed by code later (if msg["servername"] == "a74": ..), or (if really necessary) render the message as a string (uploading foo.txt to server a74).

The log.err API could be used as the errback handler in a Deferred chain (like the second argument to promise.then), and did useful things with the Error object and stack trace it received.

Each message had a "facility" (topic) and "severity" (an ordered scale of importance, ranging from NOISY and OPERATIONAL, up to UNUSUAL, INFREQUENT, and CURIOUS, and topping out with WEIRD, SCARY, and BAD). Each facility+severity tuple got its own circular buffer of messages. Foolscap retained a few hundred messages in each such buffer. Each severity level has an associated recommended event rate in the docs: NOISY was expected to be used for things that emit tens or hundreds of messages per second, while BAD might only happen once per process (because it records a fatal error).

If, at any time, a message was logged at level=WEIRD or higher, an "Incident" was declared. The contents of all circular buffers were bundled together into an "incident tarball", and stored to disk for later retrieval. The idea was that the infrequent events (UI requests) happened infrequently enough that the most recent one would still be in the buffer, letting us know what started the tale of woe. The more-frequent events (low-level storage operations) would not be preserved for very long, but we probably only cared about the last few before the problem.

The Tahoe client had a web-based UI. We added a "Report Incident" button to that UI, which simply took a string from the user and logged a WEIRD-level event. Our "how to report a problem" user docs pointed folks at this button, and instructions to locate the incident tarball and email it to us. This helped a lot. We did take some care (although probably not enough) to avoid logging user secrets, with some wrapper classes that knew how to censor their contents when passed to the logger.

Each log event was given a global numerical index (total ordering across the process). This lognum was returned by the API call, and could be passed into other calls (as the parent= argument) to create a parent/child relationship between the two:

event1 = log.msg("starting upload")
log.msg("calculating filesize", parent=event1)

A common pattern was to curry the parent into a log function given to a child object. The ShareFinder object used during a download would be created with a log function that had the download log event as its parent. The network messages sent to servers to find those shares were marked as children of the ShareFinder parent. This resulted in a locally-causal hierarchy of log events. Our log viewer had expander triangles to let you drill down into the portions of the log that were useful.

(Incidentally, returning an identifier from log() proved to be the most unusual feature of the Foolscap logging API. Python's stdlib logging library, influenced by Log4J, had plenty of options for customization, but none of them could return values back to the caller. I would have preferred to enhance an existing framework, rather than building my own, but I was unable to build the parent-child-relationship feature in any of the ones that I found.)

We did not have a full Causeway-style causality graph: each log event had at most one parent. We didn't encounter a strong need for it in Tahoe. It wouldn't have been hard to add, however building a viewer for it would have taken more work.

When the logging system started, it logged a special event that included the application version, PID, starting timestamp, and other machine-specific data. Each instance of Foolscap also had an "Instance ID" to help distinguish it from other processes (e.g. to resolve ties on the "who is the leader" network-level negotiation), and to detect multiple connections to the same peer (so we could shut down all but one). This data was used to scope the lognum, so aggregates of events from multiple processes could be teased apart again.

It took a year before a busy server ran long enough for us to discover that our lognum was limited to a 32-bit signed integer (the itertools.count() docs claimed it used a BigInt, but the implementation disagreed).

We established a convention of adding a short random string as a umid= detail argument to each message. We felt it was too expensive to have each log.msg capture the Python stack and include it in the message, so instead we merely stored a random identifier. If we were ever uncertain as to what code emitted the message, we just grepped the source tree for the umid. I had an emacs macro to generate and insert a 6-character string and I never noticed a collision.

log.msg("negotiationFailed: DuplicateConnection", level=NOISY, umid="XRFlRA")

Remote Access

A single line of code would enable the "logport": a pub/sub object exposed by the logging system. When a Tahoe client or server was running in a base directory BASEDIR, the file BASEDIR/logport.furl contained the FURL of its log port. A client (named flogtool) could connect to this port (speaking Foolscap) and subscribe to hear about all log messages, or a subset filtered by facility or level. The tool accepted either a FURL or a filename containing one, so flogtool tail ~/tahoe/logport.furl was a quick way to look inside the application and see what was going on. A flogtool --catch-up option would dump the contents of the circular buffers before emitting any new messages. The tool could save an event stream to disk (to a "flogfile"), dump a previously-saved flogfile, launch a web viewer to examine the data, or apply grep-like filters and emit a new (smaller) stream.

The client could also subscribe to Incidents, or retrieve previously stored Incidents.

We built a "log gatherer" service, and configured our fleet of storage servers to tub.connectTo it (via a FURL) at startup. In the most verbose mode, the gatherer would then subscribe to all log events and record them all in a database. The normal mode was for it to retrieve only the Incidents, and pass them to the classification function.

I grew really fond of being able to "plug in" to the logport in a running process and find out what it had been doing for the last few minutes. The logport FURL was persistent (the process read an existing file before generating a new random swissnum), so once I'd cut-and-pasted the flogtool tail FURL command into my local shell, I could re-launch it later with ease. The client reconnected automatically, so I could leave the tool running while I bounced the application it was talking to.

Managing the Attention Budget, Retroactive Interestingness

Foolscap's logging system derives from my experience with embedded systems, where crashes are infrequent and hard to debug. I once worked at a company whose product (a router) crashed in a datacenter in Europe. We asked the customer to install some debug firmware which had more logging enabled, in the hopes of learning enough to diagnose the problem. They were unhappy to hear us suggest that we wanted to make it crash a second time, whatever the reason. As the machine was on a private network in this datacenter, we couldn't log into it directly. We had one engineer on-site reading hexdumps over the phone to a team in the US (and that team was sleeping in a conference room during the day so they could be awake on EU hours). I vowed, never again.

Debuggers are great, if you can use them. But sometimes you can't easily reproduce the problem, or interrupt a system long enough to step through it. In those situations, you want to record enough information to figure things out afterwards.

When we're deep in the debugging process, and we have identified a number of interesting points in the code, where knowing the state of a variable (or merely whether it made it that far or not) is vital information, we have built up a lot of mental context about the system. We've done a lot of careful thinking, and the output is a set of maximally-useful log() calls. Then, once we've solved the problem, those log messages are distracting noise, preventing us from seeing the most useful data for the next problem. We usually then delete those log() calls, or maybe comment them out if we want to save at least some of the work we did.

But that's discarding a lot of work. It would be better to suppress the output, but retain the location and arguments of those calls.

There are a variety of costs which leads us to remove or disable log() statements:

code readability: excessive logging calls obscure the code that does something useful
human distraction: too much output to sort through
performance: CPU time spent generating the message, maybe rendering it into a string
memory: circular buffers take space
disk: writing to disk takes time and space

and the amount of detail we retain influences the cost. Recording a full stack trace of an unexpected exception is helpful, especially in a utility function that might be called by dozens of unrelated callers. But many invariants happen in functions which are only ever used for one purpose, so knowing who called it doesn't add very much information. Recording the arguments of each stack frame might be far more valuable, but gets expensive quickly. Foolscap's log.msg() has a stacktrace=True option to enable stack trace capture on specific calls.

Logging systems like Log4J typically have control knobs to set generation/recording thresholds, set through an environment variable or some other API. This allows no-longer-helpful log statements to be retained in the code (saving the programmer effort that was spent in deciding where to put them), and enables them to be re-enabled with a relaunch (without a recompile, and without editing the source to uncomment the log calls), but reduce the costs of human attention, disk space, and some portion of the CPU performance (there is still some cost to deciding to not log an event). But if something goes wrong, you still have to re-run the process (this time with logging enabled), and find a way to reproduce the failure.

It seems useful to having multiple thresholds here. If the main cost is human attention, then simply hiding the messages is sufficient to avoid it. If the messages are recorded in a place where they can be easily retrieved later, then we don't need to be afraid of suppressing their display. If disk is the expensive thing, then storing messages in RAM for a few minutes, and only copying them to disk if something unusual happens afterwards (making them retroactively interesting) can reduce that cost while not entirely losing the data. During early development of a feature, we would record (and maybe display) everything. After deployment, we want to display nothing (leaving the console available for actual UI) and be conscientious in how much we store.

When we're debugging a problem, we have a hunch that some otherwise-boring variable in fact is critically important. In normal operation, we wouldn't bother recording its value. The interestingness of that variable depends upon what happens later. Foolscap's circular buffers formed a nice compromise of storage vs utility. I added intermediate severity levels (CURIOUS, WEIRD) to capture events that maybe didn't indicate an unrecoverable problem, but if something did go off the rails shortly afterwards, they might provide a significant clue as to the cause. In Tahoe we used these to report integrity-hash violations, which usually indicated a malicious/faulty storage server that was feeding us corrupted data. We reacted by looking for a better share elsewhere, but if all shares were corrupted, that indicated a problem in our decoding/decryption logic. By recording the earlier suspicious events, we had more to work on if/when the full download failed.

Google's logging

Google has an incredibly sphisticated and complex logging system that pervades everything, and whose output is used for many purposes beyond logging. I'm sure I never used more than 1% of the functionality, but I was exposed to some parts because the Transparency Report used parts of its output for reporting on internet traffic.

The important bit of background is that the system developed the way it did because of an early decision to handle billing for ads based on the data in the log files. This led to the logging system turning into a production-critical system with requirements based on marketing, legal, and every product line in addition to debugging needs.

The basic format is that log messages are a protobuf with probably on the order of a hundred fields, which can't be modified lightly because many different products and tools rely on it. There are several generations of Google's database and analysis tools designed around it because all the corporate data is there. There are processes and controls in place to limit and track who has access to which fields, to redact certain info for some uses, and to sunset sensitive data while retaining data that might be needed for the long term. Just requesting new access to existing data puts your project at the back of a queue which might be 6 months behind.

Even though the data is known by all parties to be business critical, teams do still make changes to the data they record, and they have no way to find all the tools that might be dependent on them. Transparency Report lost data a few times while I was on the team because someone announced an upcoming change on a list we didn't have access to.

Two lessons I take away are:

debugging should use a different channel than production data
formats should be simple, to reduce coupling between multiple producers and consumers

Log4J

Java's log4j has served as the template for several later logging systems in Java and other languages. There are some features I want to highlight.

At a high level that not everyone notices, log4j defines a polymorphic API for producers and consumers of data. This makes it possible to have config files that specify that different levels of logging from different sources are sent to persistent logs, circular buffers, network devices, or other configurable outputs defined later.

The normal usage of log4j is that in each class, a distinct logger would be created that adds an identifier for the class to the log record. Since java classes and packages have a hierarchical name space this made it simple when configuring log destinations or grepping throgh logs to restrict your attention to a particular class or package, but less easy to focus on a particular task that crossed multiple packages. I don't know of a design that makes it straightforward to identify a group of log message sources that one wants to enable, other than by containing file.

The general approach in log4j is that you declare a logger within a class file, and logging there is done by individual log requests that specify the level, in a way that's now familiar in javascript.

The general logging interface doesn't specify that the data is text descriptions, so in Zocalo, I was able to use it to log structured data about different kinds of events. This differed from the Google logging described earlier, because each record was designed for a particular usage, and was intended to be handled by tools that knew precisely what data to expect.

This discussion is Javascript-specific. I don't want to invent a whole new system, but rather take the best-practices and tailor them to our needs.

"Focus"

When I'm developing something, whether it is a new Zoe contract or a Golang feature, I want to see all the messages corresponding to a specific context. I like the convention I've seen in other JS projects, of setting a $DEBUG environment variable. The convention is that "tags" that can be specified to $DEBUG are colon-delimited tags separated by commas:

DEBUG="SwingSet:kernel,agoric:*" ./run-the-program

With the idea that :* can be used to select all the subtags. That environment variable determines what is written to stdout/stderr.

Documenting which tags are in use is important, and gives people a simple way with minimal tooling to dump the logs that they know they need, or that we recommend to them.

Collecting logs

I like loosely-coupled systems like FluentD for aggregating and shipping the log output of a program to different backends. The bottom line with these kinds of tools is that a structured format (like JSON) is used for each message, and can thus be dispatched according to its contents.

slog(...) vs. console.*

IMO, it is important to distinguish between logs that need analysis, and simple messages that are clues to normal users. The latter is a subset of the former. The slog(...) API (whatever we settle on) is needed to provide fine-grained, detailed structured messages, and serialise them in a format that can be parsed and easily analysed, such as JSON-lines (suitable for jq -s).

JS console, on the other hand has the following standard methods (here I describe them in increasing level, with my understanding of their usage):

console.debug(...): usually uninteresting, can be enabled with, say, a DEBUG=... setting.
console.log(...): available mainly to print out stuff that a developer using printf-style debugging needs to see.
console.info(...): information important enough to tell the user, still stdout
console.warn(...): an abnormal situation that needs to be seen, stderr
console.error(...): a situation that prevents the correct execution of the program, stderr

All of those console methods should be dependent on context (so that a new console object is defined for a new context) and also call slog(...) with the context information to ensure the same messages are available in a structured format as well.

Besides the DEBUG=... flag for printing human-readable messages to stdout/stderr, I would also suggest a command-line flag such as --slog=./myfile.log (default: the moral equivalent of /dev/null) to control where the structured logs (rather than the human-readable form) are sent to use. For daemon programs like ag-chain-cosmos, it may make sense to set (or default to) DEBUG= (i.e. empty environment variable for no human-readable output at all) and --slog=/dev/stdout or somesuch.

These sensibilities are reflected in the simple implementation I have for a contextual console in the current agoric-sdk based on Anylogger. We don't need Anylogger to do this, but I would like to see the $DEBUG UX evolve and not be discarded.

Spans

@kriskowal probably has more to say about this, but like @warner's parent= flag, it is important to me to be able to identify the initiator hierarchy of a message. The extreme and upcoming-standard for doing this in a backend-neutral way is https://opentelemetry.io (JS implementation at https://github.com/open-telemetry/opentelemetry-js#implementation--sdks)

And now for something completely different...

Take a look at Rookout if you want to see something cool, but I'm not sure how to implement it securely. They use your source code, and implement "non-breaking breakpoints" that give you information on execution without interrupting or restarting the program. This reminds me of GDB's "noninvasive mode" that was useful during Hurd development (i.e. attach to a program and browse around without interrupting execution).

My takeaways

Always define a context-dependent console with the debug/log/info/warn/error methods, and allow that output to be controlled with $DEBUG. This is POLS (principle of least surprise) for JS developers.

Enable structured logging with a command-line flag, and punt to other systems for filtering, aggregation, and analysis. The different systems to which people want to send their logs is much more diverse than the ones you can anticipate.

Support distributed tracing/spans. I don't have experience with this, but I think it's useful.

bump

I just updated my presentation above. Now it's the way I wanted it.

structured logging in a clinical data warehouse ETL process

I lead development of an ETL process where each month, we take a hospital's worth of patient data and put it into a star schema for ad-hoc query. The process ran for ~40 hours back in 2011 or so; now it's up to a couple weeks.

structured_logging.py is what I came up with.

events form a tree; each has a serial number plus the serial number of its parent
- I made a tool to render it as a collapsible HTML outline with items sized to (log of) running time
each event (SQL script, SQL statement, ...) is logged before (marked with DEBUG) and after (INFO) with duration
detailed logs are formatted as CSV so you can use a spreadsheet or DB later
a coarser line oriented log goes to stdout, i.e. the Jenkins log.

Our ETL process overall is showing its age... I gather normal best practice in this area is a big pile of stored procedures... but when I think of switching to something that lacks this sort of log, I get cold feet.

GROUSE ETL (70B facts) and eliot

The GROUSE ETL process deals with an order of magnitude more data than the above process. I had it to do over again, and I came up with eventlog.py. JSON rather than CSV this time. Man, jq is cool!

I also made the output compatible with eliot; I'm not sure how much of a win that was.

how not to use log4j

The ad-hoc query system that we feed it into is i2b2. It uses log4j, but in a decade of use, nobody in our shop has managed to find the right level of logging. In normal production mode, it doesn't log each SQL statement. We found a DEBUG log level that does log every SQL statement, but when we enable it, it produces so much output that it overwhelms the enterprise log aggregators.

I don't know enough about logging systems to have anything to contribute, but I'd love to learn more. :)

As @michaelfig mentioned, I had some experience with span logging/tracing at Uber. Our transport protocol (TChannel) provided affordances for distributed context propagation and tracing at the protocol layer. The design was based on Google’s Dapper paper. The working groups for the API was called OpenTracing. The system for collecting and querying spans is called Jaeger. I believe OpenTelemetry is the next iteration on the design. Mistakes were made. I’ll describe an idealized version of what we did.

Distributed context propagation is the lowest layer of the system. This is a set of conventions for copying a namespace of inbound request headers into the in-memory context token, allowing those contexts to branch in memory, allow functions to read them from context or append entries to child contexts, and then copying them to outbound request headers. So in a call graph with three “services” A calls B calls C, A can send a header to C without modification to B. This namespace is supposed to be carefully centrally managed because of its potential to magnify performance problems.

The context can then carry trace headers, which are large random numbers for coördination-free collision-avoidance. These are the "trace", "span", and "parent". The scope of a span covers the entire duration of an operation on a given host, from request to response. Every request inherits the "trace" without modification from its context (or contrives one initially). Every request's "parent" is the "span" from its context.

There are a lot of ways to do trace collection and they all have exotic ways of failing at scale. I don’t have a great deal of insight into how that team kept the machine running. It was a combination of push and pull.

We used a combination of ELK (ElasticSearch) for application logs and Jaeger. We didn’t use Jaeger for logging as often as we could have because the ELK infrastructure was omnipresent. We did enable application logs to cross-reference Jaeger by adding spans to logs.

Regarding sampling, trace headers also carried a “the user is looking” bit that prevented a trace from being downsampled in the aggregator. This is pretty important to ensure that the system works for both the “we want latency statistics and automated root cause analysis” crowds and the “we want to debug distributed systems” crowds.

I think it would be really darn neat if the trace ID were treated as a sample priority figure. That way you could record a weighted running average of the lowest priority trace that got sampled and use that to extrapolate around downsampled logs and you could use the same figure coördinate load-shedding. A problem with load shedding in general is that if every call to service A generates 10 calls to service B (very common aggregation pattern), service B will shed load randomly so the probability of requests to A succeeding diminishes exponentially as load approaches a bottleneck on B. Adding a priority queue and shedding load based on the trace ID would mitigate that pathology.

With that in mind, I think that trace aggregation should use a pull model. The application agent would gather traces and log entries in memory, dropping traces with lower trace numbers to make room for higher ones.

For logging in general, Akshay Shah on my team wrote Zap, which has become wildly popular. The key insight is that the logger needs to strike a balance between performance and developer experience. Log sites need to be nearly free when their level or topic isn’t being observed. They need to be cheap even when they are relevant. In Go, we were able to avoid allocation entirely on the hottest paths of the networking stack because the logger didn’t allocate intermediate objects and just wrote JSON from memory.

We saw a lot of cascading failures. One common mode was “system that is fast on success and slow on failure (more logging, errors larger than responses, and retries without sufficient or coördinated load shedding) is a recipe for falling hard.

Dapper https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdf OpenTracing https://opentracing.io/ Jaeger https://www.jaegertracing.io/ Zap https://github.com/uber-go/zap

Second to Michael’s DEBUG topics ⨉ levels. That’s strangely more common in JS than Go.

Causeway

logging distributed communicating event loops, for post-mortem distributed debugging.

First, I'll gather all the relevant links here before I try to explain anything.

Causeway references

Causeway has been used from four language:

E
JavaScript
Java/Waterken
AmbientTalk

Original Causeway for E

Causeway: A message-oriented distributed debugger The original paper by Terry Stanley, Tyler Close, and Mark S. Miller
Screencast of interactive session
The E Tracing API
Causeway at erights wiki
Causeway log format

The stack format in our tc39 error-stacks proposal is derived from the stacktrace portion of the causeway log format.

Causeway for JS

Ported, and with awesome new visualizations of essentially the same log format.

Click around on the interactive demos, which still work:

Explanations of demos

Magic Wormhole timing logging

The discussion about "spans" reminded me of a scheme I added to investigate protocol timing in magic-wormhole. I wanted to know when various messages were sent and received, and how long we spent waiting for compute-bound tasks (the cryptographic PAKE protocol), network-bound tasks (round trip times), and user-bound tasks (waiting for code entry, or for permission to proceed).

Magic-wormhole involves two user clients and a central mailbox server. The two sides are not necessarily connected at the same time, and all connections will be reestablished if/when they are lost. The protocol requires about a dozen messages overall.

The API consists of a DebugTiming object available everywhere, which has an .add(name, **details) method. For one-shot events (like "user has finished typing in the code"), this is called like self._timing.add("transit connected"). But this method returns an "Event" value to which we can add more details, or declare that the event has finished:

t = timing.add("encryption begins")
ciphertext = encrypt(plaintext)
t.detail(size=len(ciphertext))
t.finish()

The Event records a start timestamp when it is first created, and a stop timestamp if/when t.finish() is called. All details are recorded in a key/value dictionary.

The Event also knows how to implement Python's "context manager" protocol, which was first added to file objects so they could automatically be closed when the operation that used them completed. The context manager protocol replaces:

try:
  f = open(filename)
  data = f.read()
finally:
  f.close()

with:

with open(filename) as f:
  data = f.read()

For Event, this lets me record the duration of an event like this:

with timing.add("encryption begins") as t:
  ciphertext = encrypt(plaintext)
  t.detail(size=len(ciphertext))

The Timing object is always created, and all protocol steps add data to it, however the data is discarded unless the user added --dump-timing FILENAME.json to the command line. If you do this on both sides, and then bring the two files together, you can run a dump-timing.py script on it, which merged the traces and started a small web server that hosted a D3-based zoomable event viewer.

Event Viewer

The UI isn't great, because my web skills are pretty poor, but it was enough to let me see where we were spending time. The fully zoomed-out view looks like:

Screenshot_2020-07-21 Timeline Visualizer(1)

Time starts on the top and progresses downwards, with '0' being the first event on the sender's side. The left column is for UI events on the sender (including how long it takes to import certain large libraries, and perform expensive cryptographic operations). The middle-left has a "sender tx" column (for when the sending client transmits a message to the mailbox server) and "sender rx" (when the sending client receives a message from the mailbox server). The middle-right has matching columns for the receiving client, and the far right has receiving client UI operations.

The middle lanes represent when a message is received by the server, and when it emits the response. The event viewer cycles through the six available lanes for each successive message, to keep the arrows from overlapping too badly. The colors are rotated too.

Each time a client sends a message to the server, the server records the arrival time, then immediately emits an ACK that includes both the message and the arrival time. This lets a client know how long it took for their message to arrive. If I zoom into the first few messages transmitted by the sender:

Screenshot_2020-07-21 Timeline Visualizer(2)

we can see that the server sent welcome to the client unprovoked (the "welcome" message is sent as soon as the connection is established), then the client sends "bind" and "allocate", which are received by the server about 20ms later (the one-way transmission delay, implying an RTT of 40ms). In response, the server sends "allocated", which provokes the client to send "claim", etc. The first phase of the crypto takes 9ms, and starts just after the "claim" is transmitted.

The "pake" message (blue arrow at the bottom) is the first that is broadcast to both clients (all previous messages were consumed by the server, but the server's real job is to facilitate the exchange of messages between the two clients). The server receives this first "pake" message about 20ms after it is sent, and immediately broadcasts it to all connected clients (which, so far, is only the sender: the receiver hasn't even launched their application yet), which takes another 20ms. The receiving client won't get "pake" for another 4.5s.

The receiving client is launched about 1.5s after the sender, and spends the next 5 seconds waiting for the user to finish typing in the code. The user hit "tab" to complete the nameplate portion of the code twice about two seconds in: both times cause a "list" message to be sent to the server, and got a "nameplates" response back. About four seconds in the user finished with the nameplate portion, which allows the client to send "claim" and "release" while the user continues to type in the secret second half of the code. The sender's "pake" message is received a moment after "claim" is processed (at this point the server has both subscribers and will forward messages immediately instead of queueing them for later). When the user finishes typing in the code, the key exchange can complete (labelled "pake2 (crypto)"). This was completed at t=6.5s, at which point the two sides started to communicate in real-time:

Screenshot_2020-07-21 Timeline Visualizer(3)

The receiving client (on the right) sends its own "pake" message (orange), followed by an encrypted "version" message. The receiving client appears to need 50ms to get a message to the server, while the server's response returns in perhaps 5ms. This strongly suggests clock drift between the different machines, as the two directions ought to be symmetric (barring one-way congestion). This version of the protocol doesn't try to measure or correct for the clock skew, and instead relies upon good NTP sync.

Once the sending client receives the "pake" message, it knows the mailbox is now open, and no longer needs the nameplate any more, so it sends a "release" to the server (which provokes a "released" response 40ms later), and begins the second phase the PAKE computation, which takes about 15ms. Once that is complete, it can send the encrypted "version" message, and then finally the first encrypted channel message (labelled "0", in magenta), followed by another ("1", in brown), which offer the encrypted file to the receiver.

The receiver then spends about 600ms waiting for permission from the user:

Screenshot_2020-07-21 Timeline Visualizer(4)

Note that message "0" was transmitted by the receiving client at t=6.75s, but was not received by the server until t=7.37s. This indicates that the client began a blocking read of stdin (waiting for the user's approval) before allowing the transmit function to complete, losing some parallelism. In this case, there was nothing to lose, because the sender couldn't start transmitting until it received permission (contained in message "1", orange), but this chart shows clearly that something wasn't working as expected.

The actual file transfer starts with the "transit-connected" marker, and ends with "exit". The offset between the sender's and receiver's notions of when transit connected is further evidence of clock skew. It appears that the receiver's clock runs about 80ms fast.

Fixed diagram

I fixed my clock skew problem, and now this diagram shows the expected symmetry between send-to-server and receive-from-server delays:

Screenshot_2020-07-21 Timeline Visualizer(5)

Note that the orange "pake" message and the green "version" message are sent from the server about 2ms apart, and arrive on the receiving client about 2ms apart, but on the sending client they arrive about 10ms apart. What's happening is that the "version" message arrives in the kernel at the expected time, but the (single-threaded) process is busy with the "pake2 (crypto)" step, and doesn't read from the socket until that is complete.

Utility for the Agoric stack

I don't think we'll be looking at millisecond-range timing charts, but the general UI of showing messages flowing between machines seems like a good jumping-off point for more sophisticated tools. If these arrows could be labelled with popups that describe the method name being delivered, and if the response message could be visually linked to the inbound message that caused it, this might be a good way to see where pipelining stalls are happening. The ideal picture has a thick bundle of parallel messages all travelling together, with a final ACK or response message coming back in the very next block. The unnecessarily-slow picture has a lot of individual back-and-forth arrows.

Being able to see when pipelining does and does not happen may be the best way to cure overuse of await or then.

A bizarre implication of replay-based time travel is one can insert or enable instrumentation in the code during replay that was not present or enabled during the original play --- as long as that instrumentation is completely effect free. debugger; statements, conditional debugger statements with effect-free conditions, and of course effect-free logging, if we have a way to exempt the log itself from the notion of the official computation being replayed.

Oh, neat. Now I'm imagining a lint checker that compares the source which was run on-chain against an instrumented version that a developer submits for local debugging replay. The linter could recognize logging statements and strip them out; if the resulting AST is the same as the one that created the transcript, the replay machine would accept it without complaint. And of course the replay machine will always just run the proposed version first and see if the transcript still matches; if so, any differences are still in the heap and haven't yet provoked externally-visible behavioral changes.

CAL (PayPal's logging system)

PayPal had (and probably still has) an extensive logging and analytics system called CAL (I forget what the name stands for, if I ever knew). It was awkward to use and ugly to the bone, but what it could do was actually pretty amazing. I ended up exploring a lot of its ins and outs early in my tenure at PayPal as it proved a useful tool for understanding what all the extended pieces of the PayPal system were and how they related to each other. PayPal's backend was highly (arguably excessively) service-ified -- in 2013 something like 1,100 service deployments, exposing around 12,000 distinct service endpoints. The overwhelming majority of these endpoints were internal: services whose clients were other services in the system and not themselves open to the Internet as a whole.

Every external HTTP request or response, internal service call or response, and database SQL query were logged in CAL, as were any errors that resulted while servicing those calls. These log messages were generated automatically by PayPal's various inter-service messaging and database libraries, but the services themselves could additionally generate log messages of their own with whatever content the services' developers felt appropriate. All these services ran on ~40,000 machines, with each machine feeding its log stream into CAL's data warehouse, which collected it all and organized it for reflection back to the people in the organization. In addition to request parameters, time stamps, source addresses, and all the other usual stuff that logging systems typically log, each request originating from outside the system -- typically HTTP requests from a customer's web browser or mobile app, but also web applications used by PayPal customer service reps and administrators -- was tagged at the ingress point with a unique identifier called a "correlation ID" that was then propagated downstream in the metadata associated with each service request that was subsequently generated in processing that original inbound request. In addition, customer interactions (such as payments) that involved multiple choreographed interactions between the customer's browser and the PayPal backend were assigned a unique transaction ID that the browser tagged onto the message traffic in the same way. All the various request data was laundered by the logging libraries to remove any customer PII or confidential financial information, so, for example, you could see that a request that should have contained a credit card number actually did, but without being able to see what the credit card number itself was.

CAL collected an enormous volume of data, which it presented via a web interface organized by date & time, service pool (i.e., the collection of server machines dedicated to some set of service deployments), or service entry point. Although you could pick any particular service call or transaction and follow its individual causal path through the system, the UI was not optimized for doing this -- you basically had to click to follow a call chain downward, but there was no graceful way to go back up. The main thing it did was a kind of what I call "Keynesian analytics": a macro scale aggregation that would show you, within a time window of your choosing, which other services a given service invoked, and how many times. However, what you got was not really a call tree -- if you saw that there were 5,000 calls to service A that resulted in 5,000 calls to service B and and 5,000 calls to service C, you could not conclude that each call to service A resulted in it calling service B and service C once each. The pattern could instead be that it called services A and B twice on some execution paths and not at all on others, or that maybe on one execution path it called service A twice and on the other it called service B twice. What you got was essentially the union of all call trees in one analytics report. It also couldn't distinguish whether a service that was called a given number of times due to traffic load or due to scheduling. For example, you might have two services each of which got invoked 300 times on date X, but service A was a batch job that ran every few minutes whereas service B was reacting to some kind of user input, so that on date Y service A would still have 300 invocations while service B might have 30,000 or none. This made resource planning tricky. Another limitation of CAL was that due to the high volume of data that was collected, data was only kept for a limited time window (a month if I recall correctly) because keeping it longer would be cost prohibitive. This meant that long term analysis could only be done on data-reduced summary records. One of the things I wanted to do was to try to tease out something in between individual service call traces and the kind of keynesian aggregates I described above, in the form of a histogram of call graph patterns. The data was there but not the engineering resources.

Despite these limitations, with such a large volume of data you could see emergent patterns that might have been invisible if the data sets were less comprehensive. Some of these insights were pretty coarse, like the discovery that the rough number of SQL queries to process a payment was probably two or three orders of magnitude larger than it should be. Others were more subtle, such as a case where a service bug was causing very high rates of systematic errors that had gone unnoticed, because, due to robust retry machinery, the errors didn't show up as failed transactions and therefor got overlooked by the normal monitoring and alerting apparatus. You could get lost wallowing in CAL data.

Brian Marick's `trace` package

Brian Marick (who among other things, literally wrote the book on testing), was a consultant to Electric Communities who developed a tracing and logging package for Java that served us well for many years, and to this day lives on in the logging and error reporting machinery inside both the ELib library (what the Java E implementation is built on) and my Elko framework. The trace package allowed you to define individual trace controllers, each of which could be parameterized with its own logging level (the logging levels themselves were the usual info/error/warn/verbose/debug hierarchy) and output sink, so that you could control logging not only by verbosity threshold but by topic area. This kind of control was nice because when you had to turn up the logging level to debug a problem, you only had to turn it up for the subsystems where it was relevant, considerably reducing the volume of log spam you had to sift through.

General thoughts on logging

A lot of logging nowadays seems to be of the form "capture everything now, filter later". This is great in theory but in my experience often results in having to write large checks to companies like Splunk and DataDog, turning logging into a cost center which then starts to get managed by bean counters (following the now paradigmatic cloud computing business pattern of having a low cost of entry which draws in lots of startups as customers, but when those companies' business expands they find themselves locked in to a legacy system that is horrifically expensive to continue to operate). I'm not sure this is a win.

In my own software work, I've found myself using logs a lot for testing and debugging distributed systems. One of the things I don't see in any of the logging frameworks I've used is a distinction between what I'd call routine logging and targetted logging. Typically you get to control the verbosity level of output but not in a way that distinguishes how the data is going to be used. For example, I always want to log message traffic -- what messages were sent, at what time, from what objects to what other objects. It's then helpful, when trying to diagnose specific problems, to inject additional log entries that describe other events or otherwise give some indication of what the various entities were doing; what ends up being useful is the interleaving between these log entries and the message traffic, so that you can see what messages triggered what behavior. However, once a problem is diagnosed, these interleaved log entries have very little value. Unfortunately, developers have a tendency to want to retain the code that generates these log entries, on the theory that they took work to produce and they might be useful in the future. However, I think this is a form of hoarding disorder. These entries (a) rarely prove useful for anything thereafter, and (b) the code to produce them clutters up the source code and impedes its legibility. Consequently, I'm inclined to just remove them from the code entirely. In contrast, I find myself constantly tweaking and tuning the routine logging code to improve its usefulness and information density.

One thing I find myself doing is pulling a log file into a text editor and massaging it to remove the noise and improve the readability. Generally I'm looking at some things and not at others, so what distinguishes signal from noise is usually context dependent, and the context varies so much from case to case that it's often not useful to automate this filtering -- you'll spend more time implementing the filtering than you'll spend doing it by hand, since the very next time the filtering you'll want will be different. On the other hand, sometimes I'll notice I'm doing the same transformations over and over again, and when I notice that I'll try to feed it back to the point where the log entries are being generated in the first place. For example, one thing I'm going to add to the Swingset kernel message logging is a feature that will translate all the various object references into the kernel namespace, even when they're being logged in particular vats, to present the illusion of a single address space even though there are several -- this was an insight from editing logs over and over again. This is a special case of a broader class of transformation I often find myself doing, and it's conceivable that more of this could be done at the source rather than as post processing: transforming long, largely illegible tokens like memory addresses, timestamps, guids, hashstrings, and the like (which are necessarily long and illegible because over time the number of entities that need to be distinguished is vast) into short, pithy identifiers, like two or three character alphanumeric tags, which only need a namespace as large was whatever case is being examined (for example, the number of unique objects in even a fairly complex test case is typically dozens or fewer).

Since I tend to do a lot of hand massaging of log files, I like log entries that lend themselves to this. Consequently, it helps if log entries are in some kind of uniform format (and, it should go without saying, are ASCII text rather than some weird binary representation), so that simple filters like sed scripts and editor macros can easily digest them. On the other hand, it's easy to go overboard with log entry structure -- this is one of the reasons that I absolutely loathe the output of the likes of log4j and all the various things patterned after it. These tools output incredibly verbose log entries that absolutely demand 100% automated processing, but the processing that you actually need to do varies too much from case to case to be 100% automated, and these verbose log formats have way too much junk that has to be removed to render them usable in any given context.

@FUDCo refers to the tracing package that Brian Marick did at Electric Communities. It is still operational in E-on-Java. It is explained at http://www.erights.org/elib/Tracing.html . This page is unattributed but I think it was written by Brian Marick.

Frankly, though it existed in E-on-Java from before the beginning --- having been written first for Original-E --- I never made use of any more than a teeny tiny fraction of its mechanisms or API surface. Agorics used the ELib portion of E-on-Java at scale for the SunSea project, which was invisible to me. @dtribble @Chris-Hibbert , did you ever make use of this tracing system in your use of ELib?

It's definitely the case that Brian Marick's tracing package was overengineered. When at State Software I wrote the first draft of what ultimately became Elko I incorporated a lot of stuff from ELib, including the trace package, but like @erights I only used a tiny portion of the tracing system's affordances. When I had to rewrite essentially all of Elko from scratch (except the open source pieces) at Yahoo! I stripped out all the machinery from the tracing package that I wasn't using, in the interest of having something simple enough to maintain, and never ended up wanting to put any of the stuff I took out back in. So it's possible that my concept of what the tracing package actually is at this point is that residual part that I actually used.

did you ever make use of this tracing system in your use of ELib?

I don't think I ever intereracted with this package.

I like the circular buffer and the logport stuff from foolscap; @michaelfig is that a good fit for the opentelemetry API?

I like the circular buffer and the logport stuff from foolscap; @michaelfig is that a good fit for the opentelemetry API?

Do you mean: can OpenTelemetry support alternative logging backends? They are trying to be provider-neutral, so if there isn't a logging provider that meets our requirements, it might (?) be possible to implement our own that does fancy stuff like that.

However, OpenTelemetry doesn't yet have an API for logging. Supporting logging is one of their goals, just tracing came first, then metrics.

Agoric / agoric-sdk

survey of logging frameworks #1318

Foolscap

The Logging API

Remote Access

Managing the Attention Budget, Retroactive Interestingness

Google's logging

Log4J

"Focus"

Collecting logs

slog(...) vs. console.*

Spans

And now for something completely different...

My takeaways

structured logging in a clinical data warehouse ETL process

GROUSE ETL (70B facts) and eliot

how not to use log4j

Causeway

Causeway references

Original Causeway for E

Causeway for JS

Magic Wormhole timing logging

Event Viewer

Fixed diagram

Utility for the Agoric stack

CAL (PayPal's logging system)

Brian Marick's `trace` package

General thoughts on logging

Agoric / agoric-sdk

survey of logging frameworks #1318

Foolscap

The Logging API

Remote Access

Managing the Attention Budget, Retroactive Interestingness

Google's logging

Log4J

"Focus"

Collecting logs

slog(...) vs. console.*

Spans

And now for something completely different...

My takeaways

structured logging in a clinical data warehouse ETL process

GROUSE ETL (70B facts) and eliot

how not to use log4j

Causeway

Causeway references

Original Causeway for E

Causeway for JS

Magic Wormhole timing logging

Event Viewer

Fixed diagram

Utility for the Agoric stack

CAL (PayPal's logging system)

Brian Marick's trace package

General thoughts on logging

Brian Marick's `trace` package