Open termi-official opened 9 months ago
Hi @termi-official , apologies for the delay.
I think what you want should be possible with Caliper, but it may require some custom configuration and queries.
Let's start with the custom num_elements
and num_inner_steps
annotations. The way you have it Caliper will create a single record with that information once in each iteration, but not associate it with any of the other regions in the loop. They should start right at the top of the loop like so:
CALI_CXX_MARK_LOOP_BEGIN(loop_ann_outer, "Time Loop");
for (auto t = 0.0; t < t_final; t += Δt) {
timestep_index++;
cali::Annotation::Guard
g( cali::Annotation("num_elements", CALI_ATTR_SKIP_EVENTS).begin(num_elements_local) );
CALI_CXX_MARK_LOOP_ITERATION(loop_ann_outer, timestep_index);
//...
cali::Annotation steps_ann("num_inner_steps", CALI_ATTR_SKIP_EVENTS)
steps_ann.begin(num_inner_steps_local);
CALI_CXX_MARK_LOOP_BEGIN(loop_ann_inner, "Update Loop");
for (...) {
CALI_CXX_MARK_LOOP_ITERATION(loop_ann_inner, ...);
...
}
CALI_CXX_MARK_LOOP_END(loop_ann_inner);
steps_ann.end();
}
If you know these ahead of time you can also put them outside of the loop entirely. The CALI_ATTR_SKIP_EVENTS
flag is useful if these annotations just provide additional information and you don't actually want to measure the time for its begin/end region.
I think the best strategy here is to collect a full profile into a .cali file and run queries on it. Once we have the queries figured out we can create a custom config to run the query online and produce text or json directly.
The config to collect a full profile should look something like this:
CALI_SERVICES_ENABLE=event,mpi,aggregate,timer,recorder
CALI_EVENT_ENABLE_SNAPSHOT_INFO=false
CALI_AGGREGATE_KEY="*,iteration#Time\ Loop"
CALI_MPI_WHITELIST=MPI_Waitall
CALI_RECORDER_FILENAME="report-%mpi.rank%.cali"
The CALI_EVENT_ENABLE_SNAPSHOT_INFO=false
will disable explicit region begin/end attributes, which are likely just getting in the way for what you want. The CALI_AGGREGATE_KEY
field is probably the most obtuse one. It's essentially a "group by". The *
includes everything except by-value entries. The iteration attributes are by-value entries, so we'll have to add them explicitly. The example above will "group by" everything including the outer loop iteration, so you'll get a time series for the outer loop. Everything in the update loop will get aggregated. If you also need to distinguish the update loop iterations, include it in the aggregate key. At that point you might as well record a trace though, unless there are more nested loops with MPI functions or Caliper regions. Don't forget to set either CALI_MPI_WHITELIST
or CALI_MPI_BLACKLIST
if you want to time MPI functions.
This should produce a .cali file, and you can run cali-query --table
or cali-query --tree
to see what's in it. It should contain all the information we need, i.e. the regions, MPI functions, your custom annotations, and the loop iterations. From there we can narrow things down with queries. The CalQL documentation https://software.llnl.gov/Caliper/calql.html might be useful for writing those. Also, you can see all the attribute keys in the file with cali-query --list-attributes -t
. Maybe you can play around with that. If you have an example for what kind of output you want to see exactly I'm happy to help designing those queries. The queries to generate the loop report for example certainly have some quirky stuff.
Thanks for the detailed response David. This clears up some of my questions. I could also track down that the file size literally exploded without setting CALI_MPI_WHITELIST
/CALI_MPI_BLACKLIST
. Also the pointer to CALI_EVENT_ENABLE_SNAPSHOT_INFO
is another thing I missed somehow.
For the number of elements, I do not know the number ahead of time as it is dynamically determined through an error estimation procedure.
I have a first workflow where I first use cali-query to generate a table which I then filter with some scripts. I will definitely report back with some examples and will try myself with the new information here first.
Btw, is it intended that the loop-report does not "see" the mpi.rank
variable? It gets replaced with an empty string for me on the current master.
Yes, the loop-report config unfortunately doesn't recognize the `%mpi.rank% variable. In fact it currently doesn't have a flag to split output per rank at all right now. It should be possible to write a query to produce similar output though.
Hi,
I am trying to benchmark adaptive finite element simulations using Caliper and I am super stuck in finding the correct configuration for caliper. Since I am cycling between the documentation page and permutating environment variable combinations for the last 3 days without any progress I am asking here for help.
Basically what I want is
On a very high level my program looks like this
To be specific, I want to generate a time series with time spent in MPI_Waitall+selected regions+total time+the 2 annotations per iteration in "Time Loop" to investigate how load imbalanced evolve for different load balancing strategies and numbers of processes. So my question is: How can this be achieved with Caliper? I am also happy with some external example from which I can start or the docs page, in case I missed something here.
Also related to this, is it possible that the docs are out of date? I could not really figure out where the code for the example here http://software.llnl.gov/Caliper/services.html#example can be found.
What I tried so far
My first try was to just write the raw data and use cali-query to bring it into the correct shape. With this I almost succeeded, but hit hard drive limitations very fast (since I could not figure out how to filter the event traces correctly) and I could not get the exact caliper query. Here is what I tried to generate the data
and for the query
My second attempt was to generate the required data in-situ. Here I first tried to do it via the aggregation service via
here no matter what I have put into
CALI_AGGREGATE_ATTRIBUTES
andCALI_AGGREGATE_KEY
I could not get anything meaningful. Furthermore, I am not understanding at all what I am doing wrong here and could not really deduce it from the docs, because the output is faulty in any case (the number of output columns change with each iteration and the data starts to interleave). I have just updated to master and can reproduce this.My latest idea was to make a custom loop-reporter, because it is closest to what I want. However, I was really not sure where I should even start after copy pasting
LoopReportController
. I also could not find how to extend the output of the loop controller from command line, or even just redirect the output to some specific file.Thanks in advance, Dennis