NWChemEx / ParallelZone

You're travelling through another dimension, a dimension not only of CPUs and threads but of GPUs; a journey into a wondrous land whose boundaries are bandwidth limited. That's the signpost up ahead - your next stop, the ParallelZone!
https://nwchemex.github.io/ParallelZone/
Apache License 2.0
0 stars 1 forks source link

Adding Logging capabilities to runtime #48

Closed wavefunction91 closed 2 years ago

wavefunction91 commented 2 years ago

Adds Logger type and various implementations (STDOUT/ERR, File, Stream, Null) + integration into ResourceSet. Tighter integration must come after ResourceSet is filled out. Added #57 as a cleanup of TODOs for this PR

wavefunction91 commented 2 years ago

This is more or less r2g, just need API input.

wavefunction91 commented 2 years ago

I'm not entirely sure if that's necessary, but I could be convinced otherwise. In my view, a Logger is a type that dumps relevant output (e.g. program progress, debug, etc) to some stream (STDOUT/ERR, log file, etc). This is typically treated as a per-process concept (i.e. progress loggers default to root only, debug dumps to per-rank log files, etc). I'm all for having a "master logger" for e.g. progress output, but the typical use case would dictate that this would be tied to some particular (subset of) rank(s), so it seems that there would be a bit of redundancy.

As for having something like a persistent STDOUT/ERR logger in RuntimeView, we can definitely add it, but wouldn't that be a bit over-engineered? If the dev wants to override some standard progress logging capability and dump directly to STDOUT, wouldn't they just use std::cout?

ryanmrichard commented 2 years ago

TL;DR I would have expected:

runtime().progress_logger(...);
runtime().debug_logger(...);

to be the set of master loggers (which only actually print on rank 0 of the RuntimeView being returned from the opaque runtime() function).

Longer answer:

If I'm understanding you correctly you're thinking:

If so, I'm not comfortable insisting that "progress" information is always global (i.e., every rank would print the same thing) and that "debug" information is always "local" (i.e., conceivably different rank to rank). To me, "progress" information is just the most important information, whereas "debug" information is very verbose information unlikely to be of use except for debugging purposes. With those definitions I think both progress and debug categories will contain examples of global and local information.

I'm also not a fan of two members of the same class, with relatively similar purposes, behaving so differently as I think that is confusing. It's also somewhat inconsistent if you go through the runtime().at() function. For example I think it is reasonable to expect:

runtime().at(i).progress_logger(...);

to have rank i print the provided information (with the information first being sent to rank i if the current process isn't rank i). However, with "progress" only printing on rank 0, then runtime().at(i).progress_logger() is a no-op unless i == 0, which I think is rather unexpected.

IMHO we need both a global and a local logger and I think we need both global and local versions of the progress and debug streams. While conceivably we could get away with just local loggers, I worry that relying exclusively on the runtime().my_resource_set().progress_logger() API is going to lead to either a large number of very redundant output files (each rank-specific file would be basically a copy of the same information) or if the output files are concatenated into one file, a file which contains number-of-mpi-ranks copies of basically the same information (I am assuming that the user saves all of the outputs since they in general will contain some information which is rank-specific).

So I am inclined to suggest that the runtime().my_resource_set().progress_logger() API be used for printing progress information that is specific to the local process, whereas the runtime().progress_logger() API be used for printing information that all ranks of the RuntimeView agree on (by default we could set it up so that only rank 0 actually prints and that all other ranks redirect to /dev/null). When a RuntimeView gets partitioned things get more complicated, but I'm also fine punting on that until we can actually partition the RuntimeView.

wavefunction91 commented 2 years ago

Gotcha, fixed.