Feature request: An uncoordinated version of `stdout`

Here's a computational idiom that caused a lot of head-scratching today from a performance debugging perspective:

coforall loc in Locales do loc {
  var t: timer;
  t.start();
  readInData();
  writeln("reading in data took ", t.elapsed(), " seconds");
  var work = getMyWork();
  writeln("locale ", here.id, " is starting to work on ", work);
  doComputation();
  writeln("locale ", here.id, " finished its work at ", t.elapsed(), " seconds");
}

The strange behavior was that some locales took precisely 2x as long as the others do doComputation(); despite the fact that they were working on a problem that was perfectly load-balanced (where the granularity of the timings was "dozens of seconds"). [edit: Note that doComputation() is designed to be a local-only computation with no communication]

[If you like puzzles, you may want to pause here to think about why this would be before proceeding]

I initially ruled out the writeln()s as the cause of the problem, thinking "Well, they require coordination and lead to a potential bottleneck on locale 0, but the overhead shouldn't be enough to result in a 2x overhead, and at worst, it seems like it should only affect locale 0. But in fact it was the writeln(), simply for different reasons: Specifically, once locale 0 started its work, all of its cores were consumed executing tasks within doComputation();, leaving none available to handle the incoming writeln()s. As a result, any locale that didn't get its first two writeln()s handled before locale 0 started its work would be gated, waiting for locale 0 to complete before its writeln()(s) could complete and it could start its work. That means such locales would wait one "doComputation()" amount of time before starting, and then a second to do their own work, resulting in the 2x.

Note that the recent work to create a dedicated AM handler core seems like it might help with this because that core would be able to receive the writeln() requests even if all the other cores were doing computation. But it might not because if (a) it doesn't do the writeln() itself, but queues it up for some other core to do and all the other cores are busy and (b) the original locale will be waiting not for the task to be scheduled, but to complete, then things will still be blocked.

Another stance might be "writeln() just shouldn't be used in performance-sensitive code due to the communication and coordination that it entails to ensure a coherent output stream," and I think this is a reasonable stance, but it's hard to know how to protect users from stepping into this trap apart from documentation that would make them aware of the issue and/or a --performance-warnings mode that would flag uses of writeln() within... what? Timed sections of code? coforalls?

Of course, even with that stance, someone may want to use debugging writeln()s while developing their code and performance debugging it, and for such use cases, we recently talked about whether we should have an output stream—say a per-locale stdout—that could be used for debug printing without as much overhead and coordination, with the obvious downside that things printed simultaneously from different locales may get mixed together on the console. As developers, internally, I think many of us resort to printf() for such cases, but that feels too low-level to expect from users.

Historically, when thinking about the I/O system, I wanted to have stdout work differently than it does today, but getting this right is tricky and I moved on to other tasks. In particular, I've been imagining that each Locale could have its own fileWriter for stdout (and so independently buffer data) but these fileWriters would send the buffered data once there is enough of it. (I had been imagining that this data would be sent to Locale 0, but that's note actually where stdout occurs -- it would need to get sent to the process on the head node -- but for now I am assuming that sending it to Locale 0 would be close enough, at least as first, since slurm/gasnet/whatever forwards the output & the Locale 0 business would handle the interleaving at the Chapel level). My hope has been that we can avoid the output getting interleaved mid-line by only sending the part of the buffer up to the last newline (so a partial line will not be sent). This kind of approach seems interesting but it would lead to different interleaving of the lines than we get today, and it's not so clear to me if that is acceptable or not. (And, for that matter, I also don't know if it is a capability that already exists in slurm/gasnet/...).

chapel-lang / chapel

Feature request: An uncoordinated version of `stdout` #22132