CliMA / ClimateMachine.jl

Climate Machine: an Earth System Model that automatically learns from data
https://clima.github.io/ClimateMachine.jl/latest/
Other
448 stars 77 forks source link

Logging framework #134

Open ali-ramadhan opened 5 years ago

ali-ramadhan commented 5 years ago

Maybe more of a question for @charleskawczynski @simonbyrne but any suggestions for a logging library? It's probably a good software practice to move away from print/@printf statements to a logging framework that provides detailed logs for debugging (and for users, with a lower severity).

I quite like Memento.jl but maybe it's better to rely on Julia's built-in logging? The default format isn't very good for detailed logs but I'm sure we can set a custom format with timestamps, line numbers, etc.

simonbyrne commented 5 years ago

I think starting with the built-in logging makes sense. It works pretty well, though there are some rough edges when dealing with multiple processes.

ali-ramadhan commented 5 years ago

Thanks, will look into customizing it!

simonbyrne commented 5 years ago

Re: distributed logging issues

simonbyrne commented 5 years ago

Let's keep this issue open until we've figure it out.

ali-ramadhan commented 5 years ago

Sounds good. I just closed it as you answered my question.

jkozdon commented 5 years ago

Right now our hack in the dycore is to redirect stdout to dev/null; see here

Would be better to have something like @info and @rootinfo, where @info is that standard logging with a little bit of extra info like which MPI rank is dumping the data and @rootinfo just has one MPI rank dump the data; we'd also obviously want all the other functionality of the logging package too available via @logmsg.

Could be useful to also have the info statement "registered" in some way so that we know what is dumping the message (atmosdycore, oceandycore, landmodel, etc.). Something along the lines of

# assuming mpirank == 0
julia> @info "message"
[ info, AtmosDycore, 3: message

julia> @rootinfo "message"
[ info, AtmosDycore, --: message

julia>
# assuming mpirank == 3
julia> @info "message"
[ info, AtmosDycore, 3: message

julia> @inforoot "message"

julia>

I remember thinking a while ago that this should all be possible with a fairly thin layer on top of Logging, I just never quite fully tried to do it.

c42f commented 4 years ago

Hey guys, I just saw this issue and being the author of the standard logging frontend I think it's flexible enough to do what you want (yes I am biased ;-) ) If it's not I think we should tweak it to make it so! To have truly composable logging I think it's critical that all packages should agree on a common front end for emitting logs, with pluggable backends depending on the use case.

The lack of functionality in stdlib Logging backends is mainly because of a lack of clear use cases and distributed logging is no exception. If you can provide a compelling and difficult use case I would love to collaborate on making the system better.

Could be useful to also have the info statement "registered" in some way

Fairly rich metadata is already present with each log record, including the originating module. You just need to customize the printing (the meta_formatter argument to ConsoleLogger or equivalent in your own logger). Dynamic information such as MPI rank can be harvested by plugging in an appropriate AbstractLogger backend. You don't need to change the frontend logging macro invocations for this. This is important because it means you can automatically attach MPI rank to log records which originate from third party modules which are otherwise outside your control.

It's all about composability.

smarras79 commented 4 years ago

It seems that this open issue handles what was mentioned at our meeting yesterday. As @simonbyrne asked, information that should be logged for reproducibility of a test should at least contain the following:

Can memento be setup to extract these details automatically?

kpamnany commented 4 years ago

My $0.02: the standard logging front-end is the right way forward. An info dump of the details of the run as suggested by Simone should certainly be standardized, but it is orthogonal to the logging mechanism.

For the back-end though, please do note that as soon as you get up to 64 ranks or more, you essentially have to turn logging off, or redirect to file and then grep through a bunch of log files. As you get to higher rank counts, you will overwhelm whatever parallel file system with the number of log files. And remember that debugging at high levels of parallelism pretty much requires logs.

We have dealt with these issues in the past using Dlog. It does require some helper tools, but nothing else we tried worked at scale. I propose to implement this as a Logging back-end which is selected above some threshold of ranks. Below that threshold, a simple MPI-awareness tweak to the standard back-end will suffice.

c42f commented 4 years ago

We have dealt with these issues in the past using Dlog. It does require some helper tools, but nothing else we tried worked at scale. I propose to implement this as a Logging back-end which is selected above some threshold of ranks.

A custom backend sounds ideal for big distributed jobs and Dlog seems nice and simple (I suppose with sparse files a large SpacePerRank is not really a problem?) Log aggregation / filtering / search could be done as a separate parallel job itself if required. If you need custom tooling you could also consider using a binary log format to preserve the full julia type of log records.

It should be easy and scalable to install any runtime filtering rules on the task that produces the logs but I haven't thought about scalability in sinking the resulting records. It might also be handy to sink detailed logs using Dlog but additionally send a low volume selection of high-priority warn/error/progress messages across the network for real time monitoring. LoggingExtras.jl has some tooling for log routing which could be a good starting point for that kind of thing.

simonbyrne commented 4 years ago

Notes from discussion with @vchuravy @leios @charleskawczynski and @skandalaCLIMA:

We can build an "MPI Logger" on top of MPI I/O: specifically open a shared file pointer, and write messages to the file via MPI_FILE_IWRITE_SHARED, printing the source rank along with the message. This will ensure that multiline messages don't get "mixed up" as they do now.

We would need to expose MPI_FILE_IWRITE_SHARED via MPI.jl (though that should be easy now that some of the MPI I/O stuff is exposed), and it might make sense to implement this in a new package.

Some commercial/3rd party tools we might want to explore:

c42f commented 4 years ago

See also the related discussion about Distributed logging at https://github.com/c42f/TerminalLoggers.jl/issues/22

A question for you: will you serialize the log records rather than turning them into text? That would allow much more powerful and natural filtering provided you're willing to build a small amount of tooling.

simonbyrne commented 4 years ago

We could: MPI I/O is just writing bytes (or any other MPI datatype), but we would need to define some sort of header for each message.

c42f commented 4 years ago

Doesn't the Serialization stdlib do most of that work for you?

julia> using Serialization
       io = IOBuffer()
       serialize(io, Any[("message", (x=1.0,i=10,s="blah"))])
       data = take!(io)
       # put data in a file or whatever

julia> deserialize(IOBuffer(data))
1-element Array{Any,1}:
 ("message", (x = 1.0, i = 10, s = "blah"))
simonbyrne commented 4 years ago

I meant that we would need a header to denote what MPI rank the message originated from.

c42f commented 4 years ago

We could but, at the moment, we have function pointers in the balance law, which I've heard cannot be reliably serialized.

I think Ptr{T} are serialized as null, so depending on what you want to do this could lead to trouble :-)

I meant that we would need a header to denote what MPI rank the message originated from.

Right :+1: If it's on a per log record basis you could alternatively just tag each message coming from the rank with an mpi_rank=$rank key value pair. That's the kind of thing that extensible key value pairs should work well for and filtering later on would be quite natural.

simonbyrne commented 4 years ago

Ah, cool.

Is there a list of packages which implement different logging backend somewhere? I'd be keen to see what others do.

c42f commented 4 years ago

Is there a list of packages which implement different logging backend somewhere

We've got some brand new tooling for REPL-based use (see TerminalLoggers.jl) which I'm excited about, but I'd say we're still figuring out what's possible and desirable for "serious projects". I've got a rough list of tooling at https://github.com/c42f/MicroLogging.jl/#julia-logging-ecosystem but this may not be very complete.

The CLIMA case may be the first large scale HPC application using this stuff. Just in the last week I've been chatting to @tanmaykm about backends for some Julia Computing internal stuff (JuliaTeam/JuliaRun) which I think is another interesting case study and shares some of the same "serious application" needs. Though it runs on completely different infrastructure from what you'll have.

Here's a snippet which would allow you to tag all log messages from a given Task with the mpi rank:

using LoggingExtras

# Something like this tool should go in a package somewhere...
function tag_logs(f; kws...)
    function add_keys(log)
        merge(log, (kwargs = merge(values(log.kwargs), values(kws)),))
    end
    with_logger(f, TransformerLogger(add_keys, current_logger()))
end

Then for example (with the standard REPL logger backend):

julia> function run_simulation()
           @info "Starting"
           @warn "A thing"
       end
run_simulation (generic function with 1 method)

julia> run_simulation()
[ Info: Starting
┌ Warning: A thing
└ @ Main REPL[8]:3

julia> mpi_rank = 10; # MPI.Comm_rank(...)) I guess

julia> tag_logs(mpi_rank=mpi_rank) do
           run_simulation()
       end
┌ Info: Starting
└   mpi_rank = 10
┌ Warning: A thing
│   mpi_rank = 10
└ @ Main REPL[8]:3

Note that tag_logs (and indeed all loggers) act in dynamic scope, so there's no need for run_simulation() to know about mpi_rank.

trontrytel commented 3 years ago

The silent mode for running the model would be useful for me when generating the tutorial markdown pages. For example here I just want to showcase the plots. The @Info just clutter the output:

https://clima.github.io/ClimateMachine.jl/previews/PR1375/generated/Numerics/DGMethods/showcase_filters/

jkozdon commented 3 years ago

posting some of my thoughts from slack (really a bump of my previous comment: https://github.com/CliMA/ClimateMachine.jl/issues/134#issuecomment-477271723)

Would certainly be nice to have more fine grained control than @info and @debug. Maybe a:

Also making it MPI friendly with something like @inforoot vs @info where the later only dumps data on the root rank (what we currently do) as opposed to all ranks (not possible right now).

Awesome bonus would be module specific debug, e.g., dump debug info from the driver but not the mesh.

To make all this work we need to be in the habit of curating output statements added to PRs.

simonbyrne commented 3 years ago

There are a couple of different ways we might want to print messages:

There are also fancy things we can do where some output goes to stdout and some to a file (see https://github.com/oxinabox/LoggingExtras.jl).

One question is how this should be exposed at the driver level?