Now that I have a working windows build of timemory, I have a few usage questions.

What is an easy way to replicate what opencv is doing with CV_TRACE_REGION_NEXT? See Tracing macro list for details, but it is simply a way to easily use regions for different steps.

TRACE_REGION("step1")
step1_computation();
TRACE_REGION_NEXT("step2")
step2_computation();

So far I have only used the library api functions timemory_push_region and timemory_pop_region. But I believe this is more easily done using the timemory c++ api, but I am not sure how to do this or where to look for guidance.

Another question related to how to transition from the opencv tracing macros to timemory: opencv makes a distinction between tracing a function and tracing a region. Is such a distinction helpful? If so, how do I make this distinction with timemory? (If it matters, my primary backend profiling/analysis tool is VTune.)

How do I use timemory to count user defined events? For example, with an image processing application, I would like to count the number of images read, written, processed, displayed, copied, etc. Can timemory help me keep track of these events?
Can I use timemory to to profile how timing changes over the duration of a long running process? Again for an image processing application that runs for 7 days, how do I get a timeseries of profiling measurements so I can see how performance changes over time? For example, do image reads take longer on day 7 hour 24 than they did on day 1 hour 1?

Documentation suggestion:

For my usage, I need to run multiple processes simulateneously , and I need to have each process have a unique output path. Of course, with timemory this is possible, but it took me a little while to figure out how. Here is my summary code snippet. Is there a place in the documentation to put a tip like this? I would like to contribute if it is useful and I know where to put it.

    // modify timemory settings
    tim::settings::banner() = false;
    tim::settings::output_path() = GetOutputPath().generic_string();
    tim::manager::instance()->update_metadata_prefix();

    // get modified timemory settings for the trace output directory and make sure it has the expected value
    auto tim_output = tim::settings::get_global_output_prefix();
    assert(tim_output == GetOutputPath().generic_string()));

    // shutting down the timemory library now will output to new custom output path
    timemory_finalize_library();

Thanks again for your help!

What is an easy way to replicate what opencv is doing with CV_TRACE_REGION_NEXT?

Well, normally, adding a new component for another marker API is pretty straight-forward, e.g. the ones that already exist for Caliper, TAU, VTune, NVTX, etc. but the way OpenCV implements it makes it more difficult than normal.

The underlying object OpenCV uses (cv::utils::trace::details::Region) has an extremely restricted interface: Region(const LocationStaticStorage& location) and that is it.
Constructing LocationStaticStorage is quite complicated. See CV__TRACE_DEFINE_LOCATION_ and CV__TRACE_DEFINE_LOCATION_FN.

Currently, doing this will require a heap allocation until I add a new bundle variant for handling components with this sort of non-default-ctor restriction.

Basically the component would look like this:

TIMEMORY_DECLARE_COMPONENT(opencv_trace)

#if !defined(USE_OPENCV)
TIMEMORY_DEFINE_CONCRETE_TRAIT(is_available, component::opencv_trace, false_type)
#endif

#if defined(USE_OPENCV)
namespace tim
{
namespace component
{
struct opencv_trace : public base<opencv_trace, void>
{
    using region_t = CV_TRACE_NS::details::Region;

    opencv_trace()                    = default;
    ~opencv_trace()                   = default;
    opencv_trace(const opencv_trace&) = delete;
    opencv_trace(opencv_trace&&)      = default;
    opencv_trace& operator=(const opencv_trace&) = delete;
    opencv_trace& operator=(opencv_trace&&) = default;

    static std::string label() { return "opencv_trace"; }
    static std::string description() { return "Provides OpenCV tracing API"; }

    void start(const LocationStaticStorage& _loc)
    {
        if(!m_data)
            m_data.reset(new region_t{ _loc });
    }
    void stop() { m_data.reset(); }

private:
    std::unique_ptr<region_t> m_data = {};
};
}
}
#endif

And now you can pass LocationStaticStorage to the start(...) function of the bundle. But you'll need something to reproduce this construction of LocationStaticStorage, like this:

namespace // ensure instantiation is unique to each compiled file
{
template <size_t, char> // template on line number and first char of name
const auto&
get_cv_trace_define_location_fn(const char _name[], const char _file[], int _line,
                                int _flags)
{
    // from CV__TRACE_DEFINE_LOCATION_
    static CV_TRACE_NS::details::Region::LocationExtraData**         _data = nullptr;
    static const CV_TRACE_NS::details::Region::LocationStaticStorage _inst =
        = { &(_data), _name, _file, _line, _flags };
    return _inst;
}
}  // namespace

#define CV__TRACE_INLINE_LOCATION_FN(name, flags)                                        \
    get_cv_trace_define_location_fn<__LINE__, name[0]>(                                           \
        name, CV_TRACE_FILENAME, __LINE__,                                               \
        (flags | CV_TRACE_NS::details::REGION_FLAG_FUNCTION))

and finally, usage would resemble this (which could be hidden behind a macro:

    tim::component_tuple<opencv_trace> _obj{ "opencv_trace_example" };
    _obj.start(CV__TRACE_INLINE_LOCATION_FN("opencv_trace_example", 0));

Another question related to how to transition from the opencv tracing macros to timemory: opencv makes a distinction between tracing a function and tracing a region. Is such a distinction helpful?

If you look at the#define CV__TRACE_REGION(...) vs. #define CV__TRACE_OPENCV_FUNCTION() in core/utils/trace.hpp you will see that a CV__TRACE_OPENCV_FUNCTION is essentially just CV__TRACE_REGION(__FUNCSIG__).

How do I use timemory to count user defined events?

There is a templated tim::component::data_tracker<Tp, Tag> where Tp is the data type you want to track and Tag is just an arbitrary struct to differentiate component X that tracks ints and component Y that tracks ints. You basically put it in a bundle and call the store(...) function.

It can be quite useful to create a dedicated auto_tuple bundle for the just the components which track data bc you can append to the history in a single line bc the auto_* bundles call start() and stop() at construction/destruction.

This is probably best demonstrated with an example.

    using namespace tim::component;

    // component_tuple requires explicit start, stop is optional (will call stop if started when it gets destroyed)
    using bundle_t  = tim::component_tuple<opencv_trace, data_tracker_integer>;

    // auto_tuple automatically calls start and stop
    using tracker_t = tim::auto_tuple<data_tracker_integer>;

    // auto_* bundles will write to stdout when destroyed
    tim::settings::destructor_report() = true;

    bundle_t _obj{ "opencv_test" };
    _obj.start();

    for(int i = 0; i < 10; ++i)
    {
        long ans = fibonacci(10) + fibonacci(10 + (i % 3));

        // store a single iteration
        tracker_t{ TIMEMORY_JOIN("", _obj.key(), "#", i % 3) }.store(ans);
        // join macro is like pythons "#".join(...)

        // accumulate into parent bundle and demonstrate using lambda
        // specifying how to update variable
        _obj.store([](long cur, long upd) { return cur + upd; }, ans);
    }

The result:

>>>  opencv_test#0 :          110 data_integer [laps: 1]
>>>  opencv_test#1 :          144 data_integer [laps: 1]
>>>  opencv_test#2 :          199 data_integer [laps: 1]
>>>  opencv_test#0 :          110 data_integer [laps: 1]
>>>  opencv_test#1 :          144 data_integer [laps: 1]
>>>  opencv_test#2 :          199 data_integer [laps: 1]
>>>  opencv_test#0 :          110 data_integer [laps: 1]
>>>  opencv_test#1 :          144 data_integer [laps: 1]
>>>  opencv_test#2 :          199 data_integer [laps: 1]
>>>  opencv_test#0 :          110 data_integer [laps: 1]
[data_integer]|0> Outputting 'timemory-ex-derived-output/data_integer.json'...
[data_integer]|0> Outputting 'timemory-ex-derived-output/data_integer.tree.json'...
[data_integer]|0> Outputting 'timemory-ex-derived-output/data_integer.txt'...

|---------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                       STORES SIGNED INTEGER DATA W.R.T. CALL-GRAPH                                                      |
|---------------------------------------------------------------------------------------------------------------------------------------------------------|
|       LABEL         |   COUNT    |   DEPTH    |    METRIC    |   UNITS    |    SUM     |    MEAN    |    MIN     |    MAX     |   STDDEV   |   % SELF   |
|---------------------|------------|------------|--------------|------------|------------|------------|------------|------------|------------|------------|
| >>> opencv_test     |          1 |          0 | data_integer |            |       1469 |       1469 |       1469 |       1469 |          0 |          0 |
| >>> |_opencv_test#0 |          4 |          1 | data_integer |            |        440 |        110 |        110 |        110 |          0 |        100 |
| >>> |_opencv_test#1 |          3 |          1 | data_integer |            |        432 |        144 |        144 |        144 |          0 |        100 |
| >>> |_opencv_test#2 |          3 |          1 | data_integer |            |        597 |        199 |        199 |        199 |          0 |        100 |
|---------------------------------------------------------------------------------------------------------------------------------------------------------|

Can I use timemory to to profile how timing changes over the duration of a long running process?

So I am currently working on getting a support for timemory in a python project called Hatchet which essentially takes the *.tree.json output files and converts them into a Pandas dataframe for extended analysis. Also, in the longer term, we will be developing a framework for performance unit testing which will automate historical comparisons but, at present, the roll-your-own-solution would be to just import multiple jsons into Python and do a comparison there. The regular .json files (w/o .tree) provide the results in a flat JSON array so it is relatively easy to traverse (the json trees require recursion to process).

Also, you can specify TIMEMORY_INPUT_PATH=/path/to/some/folder/of/results + TIMEMORY_DIFF_OUTPUT=ON in the environment (or the corresponding tim::settings::input_path(), etc.) and when the results are finalized, timemory will search that input folder, try to find a corresponding input, and then generate additional *.diff.* files in the output folder with the difference between any matches it can find.

For my usage, I need to run multiple processes simulateneously , and I need to have each process have a unique output path.

One options is to prefix the files with some key (such as the PID):

tim::settings::output_prefix() = std::to_string(tim::process::get_id()) + "-";

Another options is to specify the TIMEMORY_TIME_OUTPUT=ON in the env (tim::settings::time_output()) and that will create subdirectories which are time-stamped. This time-stamp gets fixed for the process the first time settings::get_global_output_prefix() is called and time-output is enabled. Also, there is a time_format() in settings which allows you to customize the time-stamping format using the strftime format.

You might find it helpful to peruse some of the unit-tests, such as the one for data_tracker.

Thank you for your detailed answer.

Regarding (2), counting user events, I will look at data_tracker and the unit test. I will also look over some of the other unit tests and examples. Are there other examples or tutorials describing how to use the timemory api to provide finer control over the instrumentation?

Regarding (3), profiling a long running process, simplistically, I think all I am looking for is to periodically restart the timemory system within the running process. For example, rather than waiting for the end of a 24-hour long process, I call finalize to output the current results and then re-initialize to restart a new collecion? Is it possible to call finalize and init multiple times within a single process? Of course, I will have to change the output path, so the previous results are not overwritten. Is there a better way to restart timemory?

Regarding (1), I think you missed the point of my question about OpenCV's tracing api. (Thank you. You did give me a lot of helpful information, and I will study it carefully as I learn more about timemory.) I am not trying to use timemory to call opencv's tracing function, I simply want to use timemory push/pop region functionality in the same, easy way that CV_TRACE_REGION_NEXT works. That is, rather than having to do the two step (1) pop the current region and (2) push a new region, trace region next does the two steps in one.

My guess is that this would be more easily achieved using the timemory c++ api than its library api calls to timemory_push_region and timemory_pop_region, but I am not familiar with the timemory c++ api. What is a good example to study? The region next functionality needs only a little more than a simplistic RAII class that calls push region in the constructor and pop region in the destrcutor; but it additionally needs a next method that calls pop and then push.

Are there other examples or tutorials describing how to use the timemory api to provide finer control over the instrumentation?

At this point, the C++ template API doesn't have an official tutorial. There are some docs on the API and the examples and tests but that's currently all thats available. Unfortunately, right as we were introducing timemory via it's first presentation at the International Supercomputing Conference in Germany, the pandemic hit and it's quite difficult to give hands-on tutorials virtually (which is when those materials will be created). Once travel resumes, I expect there will be quite a bit of materials created for this but without the feedback these provide on what people want/need to know, creating those materials are a bit of a guessing-game.

Writing output during running process

First off, do not call timemory_finalize() and then try use timemory after that point. That routine is designed to delete a lot of things and segfaults are basically a certainty. However, you can pretty much dump the current state of the call-stack storage at any point for any thread on any process any time you want as long as you are fine with the side-effects. And using this approach, you can have full control of where the files get written. The side-effects are:

Any of the components that have been pushed onto the call-stack will be popped off the call-stack unless you set settings::stack_clearing() = false
- Setting this to false may cause issues. You might get nonsensical results for components bc they might represent a single sample as a phase measurement. You might get zero in the laps columns, etc.
If you use the higher level storage<Tp>::instance()->get() on the master thread, it will merge in the child threads and clear their storage
If you use the higher level storage<Tp>::instance()->dmp_get() (distributed memory parallelism, i.e. MPI or UPC++) on rank zero it will return the storage from all processes.
You will likely artificially inflate the values for components measuring memory usage if those are still collecting when you retrieve the storage of, say, the timers.

The best references for how to dump outputs are in the source/timemory/operations/types/finalize folder. The forward declarations (and a few definitions) are here. You can see examples of how they are invoked in source/timemory/storage/definition.hpp (and further down: here from 753 to 926)

If you aren't used to template meta-programming, you may find it difficult to create a generic function which does some operation on all components -- if that happens, just let me know and I can provide you with something very easily. The python bindings probably do what you are looking for, e.g.:

import timemory

def write_json(fname, specific_types = [], hierarchy = False):
    data = timemory.get(hierarchy=hierarchy, components=specific_types)
    with open(fname, "w") as f:
        f.write(data)

And those bindings are here from 99 to 183 and would be invoked like this (minus the stuff specific to python in _as_json):

    auto _as_json_classic = [](const std::set<int>& _types) {
        using tuple_type = tim::convert_t<tim::available_types_t, std::tuple<>>;
        auto json_str    = manager_t::get_storage<tuple_type>::serialize(_types);
        return json_str;
    };

    auto _as_json_hierarchy = [](const std::set<int>& _types) {
        std::stringstream ss;
        {
            using policy_type = tim::policy::output_archive_t<tim::manager>;
            auto oa           = policy_type::get(ss);
            oa->setNextName("timemory");
            oa->startNode();
            get_json(*oa, _types, std::make_index_sequence<TIMEMORY_COMPONENTS_END>{});
            oa->finishNode();
        }
        return ss.str();
    };

    auto _as_json = [_as_json_classic, _as_json_hierarchy](bool     hierarchy,
                                                           py::list pytypes) {
        auto types    = pytim::get_enum_set(pytypes); // convert py::list into std::set<int>
        auto json_str = (hierarchy) ? _as_json_hierarchy(types) : _as_json_classic(types);
        auto json_module = py::module::import("json");
        return json_module.attr("loads")(json_str);
    };

OpenCV's tracing API

Ah yes indeed I did miss the point of (1). My gut feels like providing this functionality would be problematic in terms of introducing unnecessary overhead. Let me think about it more and get back to you.

I understand that conferences and papers are a big driver of generating documentation. If you want a reviewer (or trial attendee), I will volunteer!

For what it is worth I would like to see a timemory version of what is in the opencv tracing example, https://github.com/alalek/opencv/blob/trace/samples/cpp/application_trace.cpp. I would be happy to contribute. And from the perspective of this demo, I think the benefits of the simplicity of the trace next capability far outweighs the performance overhead. Of course I understand that you do not want to have a deliberate inefficient api call; but you also have to balance that with maxim of making simple things simple.

I think this application_trace example would also a good demonstration of writing the output while processing. if there are a million frames, then one might want to write the output every hundred thousand frames to make sure the processing performs as expected over the range of frames. (And, so you get intermediate results in case something happens and the application code crashes before finalizing - a very real possibility in many cases.)

Have you seen the examples in ex-minimal? I probably do need to address the examples -- even I look at them sometimes and find that some of them are unnecessarily complex.

In early versions, they were also used for testing, e.g. ex-cxx-basic, ex-cxx-overhead, and ex-cuda-event, still have the old RUN_TEST(...) stuff that was hacked together before I migrated to Google-Test.

In the unit-tests, I frequently dump the storage mid-run to verify things are being collected as expected so there isn't any real concern about the functionality working but I have done a poor job of making it clear which examples/tests showcase which features and I expect my former intern @mhaseeb123 (who generated almost all of the README.md files in the examples) probably had that issue too.

W.r.t. the trace-next thing, I am not sure I quite see how introducing a new macro is better than the unnecessary curly braces comment.

The OpenCV implementation is much more simplistic than timemory and doesn't have to deal with things like this:

using foo_t = auto_tuple<wall_clock, cpu_clock>;
using bar_t = auto_tuple<wall_clock, peak_rss>;

TIMEMORY_BLANK_MARKER(foo_t, "A");
# ... something ...

TIMEMORY_BLANK_MARKER_NEXT(bar_t, "B");
# ... something ...

would the theoretical TIMEMORY_BLANK_MARKER_NEXT close the scope of the wall_clock and cpu_clock components in foo_t? Or just close wall_clock but not close the scope of the cpu_clock component?

Also, their implementation only uses that single Region class so it is relatively trivial from them to just keep a Region* "parent" record (or however they do it), whereas the modularity of timemory makes parent-child relationships more complex to track. The only place it is really tracked is in the storage class and storage is optional: you use wall_clock directly or lightweight_tuple<wall_clock> and there is absolutely zero implicit interaction with storage. Further complicating the parent instance issue is that timemory fully supports asynchronous usage and the "*_NEXT" things could cause all sorts of issues by usage like this:

TIMEMORY_BLANK_MARKER(foo_t, "A");

auto _func = [=]() 
{ 
    TIMEMORY_BLANK_MARKER_NEXT(foo_t, "B");
    // ... etc. ...
};

auto fut_0 = std::async(std::launch::async, _func);
auto fut_1 = std::async(std::launch::deferred, _func);

TIMEMORY_BLANK_MARKER_NEXT(foo_t, "C");

fut_0.wait();
fut_1.wait();

In some instances, "C" wil close "A" but the first async might close "A" if it is launched immediately. Or the first async might be launched in another thread and is then closing the scope of something in another thread. Or you might have a data race to close "A", etc.

I think OpenCV doesn't have to worry about this the way timemory does bc they have much more full control over how threads are used internally and the framework itself probably imposes restrictions about running OpenCV within an external threading system (e.g. set OpenCV to serial within OpenMP, etc.).

Thanks again for your detailed explanation. I see that the simplistic opencv approach may not be appropriate for the general purpose timemory api. But for simple application code, I usually have a top-level function very much like the main function in the [opencv trace example(https://github.com/alalek/opencv/blob/006966e6295f44e48ade066d4eb56826439680fb/samples/cpp/application_trace.cpp#L22). Most of the time that function is:

initialization
read input
loop over all the work to do (which is most of the time images to process)
- read input data
- process input (often a pipeline of function calls)
- store output
store summary of the processing that was done
finalize

which very closely matches the opencv example. Putting trace next inside the main processing loop and the function of pipeline processing calls is appealing. Avoiding the unnecessary curly braces just makes it easy to patch in performance instrumentation calls without having to change anything else. Using curly braces for scope between steps means you have to restructure any local variables that may be shared between steps. Or having to manually add pop/push pairs (which is not a big deal, just not as simple as trace-next).

And I am belaboring this point (for this last time) because we often use opencv and may have code where people will use the opencv tracing capability. I want to replace or re-implement their usage of opencv tracing with timemory. For simple instrumentation needs, it would be nice to simply redefine the opencv macros so they are implemented using timemory. But I will now drop this topic as you have helped me greatly better understand how timemory differs from opencv tracing. I hope that I have at least help contribute to future documentation or for conference tutorial material.

I have looked at ex-minimal, but I have not studied it enough to reasonably understand it. I think my next timemory steps are to:

study ex-minimal so that I understand it in detail
look at data_tracker
experiment with writing output while my process is running

If you have a suggestion for how I can at least contribute some documentation material, I will certainly try to help at least keep track of all the great feedback you have given to me. I do appreciate your time!

I think the solution for the opencv-tracing would be create a new component bundle type. Theoretically, it would look like this (this is mostly for me while the idea is in my head):

namespace quirk {
struct stop_last_instance {};
}

namespace internal {
template <typename Tp>
auto*& last_instance()
{
    static thread_local Tp* _instance = nullptr;
    return _instance;
}
}

template <typename... Types>
struct history_bundle : component_bundle<project::timemory, Types...>
{
    static constexpr bool has_stop_last_instance = is_one_of<quirk::stop_last_instance, type_list<Types...>>::value;
    static constexpr bool has_explicit_start     = is_one_of<quirk::explicit_start, type_list<Types...>>::value;

    using base_type = component_bundle<project::timemory, Types...>;
    using this_type = history_bundle<Types...>;
    using next_type = conditional_t<has_stop_last_instance, 
                                    this_type, 
                                    history_bundle<Types..., quirk::stop_last_instance>>;
    using auto_type = this_type;

    template <typename... Args>
    history_bundle(Args&&... args)
    : base_type(std::forward<Args>(args)...)
    {
        auto*& _last = internal::last_instance<next_type>();
        IF_CONSTEXPR(has_stop_last_instance)
        { 
            if(_last) _last.stop();
        }
        _last = this;
        IF_CONSTEXPR(!has_explicit_start)
        {
            this->start();
        }
    }

    ~history_bundle()
    {
        auto*& _last = internal::last_instance<next_type>();
        if(_last == this) { _last = nullptr; }
    }
};

// todo: add safer way check there is a 'next_type' alias 
// and use a static_assert with a clear error message
#define TIMEMORY_NEXT_TYPE(TYPE) typename TYPE::next_type

#define TIMEMORY_BLANK_MARKER_NEXT(TYPE, ...)                                      \
    _TIM_STATIC_SRC_LOCATION(blank, __VA_ARGS__);                                  \
    TIMEMORY_NEXT_TYPE(TYPE)                                                       \
    _TIM_VARIABLE(__LINE__)(TIMEMORY_CAPTURE_ARGS(__VA_ARGS__))

Take a gander at the macros in source/timemory/variadic/macros.hpp and lmk which ones you would like to have _NEXT variants, just in case I don't want to make this available for all of them.

If you have a suggestion for how I can at least contribute some documentation material

I was probably going to add most/all of these questions + responses to FAQ Wiki but I think if you could just keep a list of notable examples of features in the examples/tests that you come across would be tremendously helpful in creating a "Examples of Features Cross-Reference" page in the wiki.

E.g.

ex_cxx_tuple demos storage<T>::instance->mpi_get()
ex_cxx_overhead demos ...

At this time I do not believe I understand the API enough to give you an informed opinion on which macros I would like to have _NEXT variants. Given the subtleties of the richer API of timemory over the opencv tracing api, I am going to defer the next topic and concentrate on the topics of 1) writing output while running, and 2) data tracker. These are more relevant to my current needs.

I wrote an example for you that I will include in the next PR which:

Writes JSON output while running
Uses data_tracker_unsigned (i.e. data_tracker<size_t>)
Creates a custom data-tracker for std::chrono::time_point<std::chrono::system_clock> (i.e. data_tracker<time_point>)
Demonstrates some differences between different bundlers
Demonstrates how you don't need to actually rely on timemory to accumulate data (which hopefully might give you get started)

5 above is something to keep in mind. The timemory API isn't like VTune or most other profilers where you generically tell it to start recording and then stop and then either request the processed results or force output. You have full access to all the data in your application bc the components themselves are the results -- all the data they collect is stored locally as member data and all the component does when it interacts with the storage is ask the storage for a pointer to another component instance held by storage and then `+=` itself to that address. The storage class is just implementation sugar -- components don't rely on it for any functionality whatsoever. As you will see in the example, you can create a bundle, store each bundle instance in a regular STL vector and then manipulate it later -- all the storage class really does is provide a graph instead of a vector.

CMakeLists.txt

cmake_minimum_required(VERSION 3.11 FATAL_ERROR)

project(timemory-Array-of-Bundles-Example LANGUAGES CXX)

# set the name of the target you want timemory to add include-directories,
# libraries, etc. to. Just link this library to your target and it will
# include all the necessary build data
set(timemory_FIND_COMPONENTS_INTERFACE timemory-array-of-bundles-example)

# find the timemory package and specify you want the C++ library (cxx)
find_package(timemory REQUIRED
    COMPONENTS cxx)

# create your executable
add_executable(ex_array_of_bundles
    ex_array_of_bundles.cpp)

# link your executable to the interface target created by timemory
target_link_libraries(ex_array_of_bundles
    PRIVATE timemory-array-of-bundles-example)

# optionally install this executable
install(TARGETS ex_array_of_bundles DESTINATION bin OPTIONAL)

ex_array_of_bundles.cpp

#include "timemory/timemory.hpp"
#include <chrono>
#include <iostream>

// Example demonstrating:
// 1. Differences between bundles
// 2. How to control the output
// 3. Creating custom data trackers

using tim::component::cpu_clock;
using tim::component::data_tracker;
using tim::component::data_tracker_unsigned;
using tim::component::wall_clock;

// alias to a time-point type
using time_point_t = std::chrono::time_point<std::chrono::system_clock>;

// alias the templated data-tracker component for storing time-stamps
using time_point_tracker = data_tracker<time_point_t>;

// TODO: remove the need for this
// Below is due to a bug in data-tracker
namespace std
{
time_point_t
operator/(const time_point_t& _tp, const int64_t&);
//
ostream&
operator<<(ostream& _os, const time_point_t&);
}  // namespace std

int
main(int argc, char** argv)
{
    // configure these default settings before timemory_init to allow for
    // environment overrides
    tim::settings::cout_output()       = true;   // print to stdout
    tim::settings::text_output()       = true;   // write text file
    tim::settings::json_output()       = false;  // don't write flat json files
    tim::settings::tree_output()       = false;  // don't write hierarchical json files
    tim::settings::time_output()       = false;  // time-stamp the output folder
    tim::settings::flamegraph_output() = false;  // don't write flamegraph outputs

    //--------------------------------------------------------------------------------//
    // THINGS TO TRY:
    // - Use timemory-avail command-line tool with -S option to view available settings
    // - Add -d option to above to see descriptions of the settings
    //--------------------------------------------------------------------------------//

    // initialize timemory
    tim::timemory_init(argc, argv);

    //---------------------------------------------------------------------//
    // Demonstrate the difference b/t these two bundles
    //---------------------------------------------------------------------//
    // this bundle does not implicitly store persistent data. All data will
    // be local to the components in this bundle unless "push()" and "pop()"
    // are explicitly called. Thus, this bundle is useful if you want to
    // handle your own data-storage or do one initial push, and then take
    // several short measurements before doing one final pop which updates
    // the global call-stack stack with the final value.
    using lbundle_t = tim::lightweight_tuple<wall_clock, cpu_clock, data_tracker_unsigned,
                                             time_point_tracker>;
    // note: time_point_tracker can be used in lightweight_tuple as long as
    // it does not interact with storage. Interacting with storage requires
    // some overloads and specializations due to the fact that time_point_t
    // doesn't support operator such as +=, -=, <<, and so on. It is
    // safe to remove from above if you want to experiment with pushing and
    // popping (nothing else below needs to be modified beyond that)

    // this bundle will implicitly "push" to persistent storage when start is
    // called. A "push" operation makes a bookmark in the global call-stack.
    // When stop is called, this bundle will implicitly "pop" itself off the
    // global call-stack. As part of the "pop" operation, this component will
    // add its internal measurement data to the component instance in the global
    // call-stack and then reset its values to zero.
    using cbundle_t = tim::component_tuple<wall_clock, cpu_clock, data_tracker_unsigned>;

    //--------------------------------------------------------------------------------//
    // THINGS TO TRY:
    // - add "cpu_util" to one or both of the aliases above
    // - add "user_global_bundle" to one or both of the alias above
    //   - set environment variable TIMEMORY_GLOBAL_COMPONENTS to "system_clock,
    //   user_clock"
    //     and run the application
    //   - user_global_bundle::configure<system_clock>()
    //--------------------------------------------------------------------------------//

    // standard container for storing each iteration instance
    std::vector<lbundle_t> _ldata{};
    std::vector<cbundle_t> _cdata{};

    // number of iterations
    int nitr = 5;
    if(argc > 1)
        nitr = std::stoi(argv[1]);
    int ndump = 1;
    if(argc > 2)
        ndump = std::max<int>(1, std::stoi(argv[2]));

    for(int i = 0; i < nitr; ++i)
    {
        std::cout << "Performing iteration " << i << "..." << std::endl;
        // the base label for the measurements
        std::string _label = TIMEMORY_JOIN("#", "iteration", i);

        // create an instance for this loop and add a custom suffix for the
        // different types. When the application terminates, you will see
        // the "component_tuple" entries in the final report only.
        _ldata.emplace_back(lbundle_t{ _label + "/lightweight_bundle" });
        _cdata.emplace_back(cbundle_t{ _label + "/component_tuple" });

        //----------------------------------------------------------------------------//
        // THINGS TO TRY:
        //  _ldata.back.push();
        //----------------------------------------------------------------------------//

        // start the measurements
        _ldata.back().start();
        _cdata.back().start();

        // get a time-stamp
        time_point_t _now = std::chrono::system_clock::now();

        // store the time-stamp in time_point_tracker
        _ldata.back().store(_now);

        // convert the time-stamp to the number of seconds since the epoch
        size_t _epoch =
            std::chrono::duration_cast<std::chrono::seconds>(_now.time_since_epoch())
                .count();

        // store the epoch value in data_tracker_unsigned
        _ldata.back().store(_epoch);
        _cdata.back().store(_epoch);

        // sleep for one second for half the iteration
        if(i < nitr / 2)
            std::this_thread::sleep_for(std::chrono::seconds(1));
        else  // consume cpu-cycles for the other half
            while(std::chrono::system_clock::now() <
                  (_now + std::chrono::milliseconds(1005)))
                ;

        // stop the measurements
        _ldata.back().stop();
        _cdata.back().stop();

        //----------------------------------------------------------------------------//
        // THINGS TO TRY:
        //  _ldata.back().pop();
        //----------------------------------------------------------------------------//

        // skip based on modulus
        if(i % ndump != 0)
            continue;

        // write a json serialization of the global call-stack
        std::string outfname = tim::settings::compose_input_filename(
            TIMEMORY_JOIN('-', "iteration", i, "results"), ".json");
        std::ofstream ofs(outfname);
        if(ofs)
        {
            std::cout << "--> Writing " << outfname << "..." << std::endl;
            ofs << tim::manager::get_storage<tim::available_types_t>::serialize();
        }
    }

    // write a json serialization of the global call-stack
    if(nitr % ndump == 0)
    {
        // basename for the file that will be written
        std::string _bname = TIMEMORY_JOIN('-', "iteration", nitr, "results");
        // compose_output_filename appends the output_path and output_prefix
        // in settings to the first parameter, appends a time-stamped directory
        // if that is enabled, and so on. Then, it just tacks on the extension provided.
        std::string   outfname = tim::settings::compose_output_filename(_bname, ".json");
        std::ofstream ofs(outfname);
        if(ofs)
        {
            std::cout << "--> Writing " << outfname << "..." << std::endl;
            // tim::available_types_t is just a tim::type_list<...> of all the
            // components that were not marked as unavailable (and thus
            // definitions are provided for them). This manager function opens a
            // cereal::JSONOutputArchive on a stringstream and does a
            // compile-time loop over all of those types. For each type, it
            // checks if the storage has any data and if it does, it serializes
            // the call-graph for that type into the archive. Once the loop
            // is completed, it converts the sstream into a string and returns
            // the string.
            ofs << tim::manager::get_storage<tim::available_types_t>::serialize();
            //------------------------------------------------------------------------//
            // THINGS TO TRY:
            // - replace available_types_t with specific components, e.g. wall_clock
            //------------------------------------------------------------------------//
        }
    }

    // use the stream operator of the bundle to print out data
    size_t _cnt = 0;
    std::cout << "\nLIGHTWEIGHT BUNDLE DATA:" << std::endl;
    for(auto& itr : _ldata)
        std::cout << _cnt++ << " : " << itr << std::endl;

    _cnt = 0;
    std::cout << "\nCOMPONENT TUPLE DATA:" << std::endl;
    for(auto& itr : _cdata)
        std::cout << _cnt++ << " : " << itr << std::endl;

    //--------------------------------------------------------------------------------//
    // THINGS TO TRY:
    // - add this to top of file after the include statement:
    //
    //  TIMEMORY_DECLARE_COMPONENT(fake_component)
    //  TIMEMORY_DEFINE_CONCRETE_TRAIT(is_available, component::fake_component,
    //                                 false_type)
    //
    // - add fake_component to the bundles and then compile and run.
    //   - you will not see anything related to fake_component
    //   - a component marked as unavailable is never instantiated in the template -> no
    //     definition of the type is ever needed
    //
    // - change "false_type" to "true_type" and then try to compile (hint: will fail)
    //    - when a component is available, a definition will be needed
    //--------------------------------------------------------------------------------//

    // demonstrate data access
    _cnt = 0;
    std::cout << "\nLIGHTWEIGHT BUNDLE DATA ACCESS:" << std::endl;
    for(auto& itr : _ldata)
    {
        // get the instance of a wall-clock component in the bundle
        wall_clock* _wc = itr.get<wall_clock>();
        // since wall-clock was part of a "tuple" type, this will always
        // be a valid pointer as long as tim::is_available<wall_clock>::value
        // is true at compile-time. If _ldata was a "list" bundle, this
        // would be a null pointer unless you previously called
        // _ldata.initialize<wall_clock>()
        if(!_wc)
            continue;
        // the "get()" member function returns the accumulated measurement value
        double _elapsed = _wc->get();
        // units are static properties of type so convert this measurement to milliseconds
        _elapsed *= tim::units::msec / wall_clock::unit();

        // get the instance of the unsigned data-tracker component
        time_point_tracker* _tp = itr.get<time_point_tracker>();
        if(!_tp)
            continue;

        // get the time-point value
        time_point_t _point = _tp->get();
        // convert to time_t
        std::time_t _today = std::chrono::system_clock::to_time_t(_point);
        std::string _ctime = std::ctime(&_today);

        // report when the measurement was taken and how long it took.
        std::cout << _cnt++ << " : " << itr.key() << " started on "
                  << _ctime.substr(0, _ctime.find('\n')) << " and took " << _elapsed
                  << " milliseconds" << std::endl;
    }

    puts("\nFinalizing...\n");

    // generate final reports, cleanup memory
    tim::timemory_finalize();

    //----------------------------------------------------------------------------//
    // THINGS TO TRY:
    // - view the text files
    // - enable json output and import the data into python
    //----------------------------------------------------------------------------//

    puts("Done!");
}

// TODO: remove the need for this
// Below is due to a bug in data-tracker
namespace std
{
time_point_t
operator/(const time_point_t& _tp, const int64_t&)
{
    return _tp;
}
//
std::ostream&
operator<<(std::ostream& _os, const time_point_t&)
{
    return _os;
}
}  // namespace std

Sample Output

$ ./ex_array_of_bundles 10 2
Performing iteration 0...
--> Writing timemory-ex-array-of-bundles-output/iteration-0-results.json...
Performing iteration 1...
Performing iteration 2...
--> Writing timemory-ex-array-of-bundles-output/iteration-2-results.json...
Performing iteration 3...
Performing iteration 4...
--> Writing timemory-ex-array-of-bundles-output/iteration-4-results.json...
Performing iteration 5...
Performing iteration 6...
--> Writing timemory-ex-array-of-bundles-output/iteration-6-results.json...
Performing iteration 7...
Performing iteration 8...
--> Writing timemory-ex-array-of-bundles-output/iteration-8-results.json...
Performing iteration 9...
--> Writing timemory-ex-array-of-bundles-output/iteration-10-results.json...

LIGHTWEIGHT BUNDLE DATA:
0 : >>>  iteration#0/lightweight_bundle :    1.002 sec wall,    0.000 sec cpu, 1605762965 data_unsigned,  [laps: 1]
1 : >>>  iteration#1/lightweight_bundle :    1.001 sec wall,    0.000 sec cpu, 1605762966 data_unsigned,  [laps: 1]
2 : >>>  iteration#2/lightweight_bundle :    1.001 sec wall,    0.000 sec cpu, 1605762967 data_unsigned,  [laps: 1]
3 : >>>  iteration#3/lightweight_bundle :    1.004 sec wall,    0.000 sec cpu, 1605762968 data_unsigned,  [laps: 1]
4 : >>>  iteration#4/lightweight_bundle :    1.000 sec wall,    0.000 sec cpu, 1605762969 data_unsigned,  [laps: 1]
5 : >>>  iteration#5/lightweight_bundle :    1.005 sec wall,    1.020 sec cpu, 1605762970 data_unsigned,  [laps: 1]
6 : >>>  iteration#6/lightweight_bundle :    1.005 sec wall,    1.000 sec cpu, 1605762971 data_unsigned,  [laps: 1]
7 : >>>  iteration#7/lightweight_bundle :    1.005 sec wall,    1.000 sec cpu, 1605762972 data_unsigned,  [laps: 1]
8 : >>>  iteration#8/lightweight_bundle :    1.005 sec wall,    1.000 sec cpu, 1605762973 data_unsigned,  [laps: 1]
9 : >>>  iteration#9/lightweight_bundle :    1.005 sec wall,    1.000 sec cpu, 1605762974 data_unsigned,  [laps: 1]

COMPONENT TUPLE DATA:
0 : >>>  iteration#0/component_tuple :    1.001 sec wall,    0.000 sec cpu, 1605762965 data_unsigned [laps: 1]
1 : >>>  iteration#1/component_tuple :    1.001 sec wall,    0.000 sec cpu, 1605762966 data_unsigned [laps: 1]
2 : >>>  iteration#2/component_tuple :    1.001 sec wall,    0.000 sec cpu, 1605762967 data_unsigned [laps: 1]
3 : >>>  iteration#3/component_tuple :    1.004 sec wall,    0.000 sec cpu, 1605762968 data_unsigned [laps: 1]
4 : >>>  iteration#4/component_tuple :    1.000 sec wall,    0.000 sec cpu, 1605762969 data_unsigned [laps: 1]
5 : >>>  iteration#5/component_tuple :    1.005 sec wall,    1.020 sec cpu, 1605762970 data_unsigned [laps: 1]
6 : >>>  iteration#6/component_tuple :    1.005 sec wall,    1.000 sec cpu, 1605762971 data_unsigned [laps: 1]
7 : >>>  iteration#7/component_tuple :    1.005 sec wall,    1.000 sec cpu, 1605762972 data_unsigned [laps: 1]
8 : >>>  iteration#8/component_tuple :    1.005 sec wall,    1.000 sec cpu, 1605762973 data_unsigned [laps: 1]
9 : >>>  iteration#9/component_tuple :    1.005 sec wall,    1.000 sec cpu, 1605762974 data_unsigned [laps: 1]

LIGHTWEIGHT BUNDLE DATA ACCESS:
0 : iteration#0/lightweight_bundle started on Wed Nov 18 21:16:05 2020 and took 1001.63 milliseconds
1 : iteration#1/lightweight_bundle started on Wed Nov 18 21:16:06 2020 and took 1000.82 milliseconds
2 : iteration#2/lightweight_bundle started on Wed Nov 18 21:16:07 2020 and took 1001.08 milliseconds
3 : iteration#3/lightweight_bundle started on Wed Nov 18 21:16:08 2020 and took 1004.05 milliseconds
4 : iteration#4/lightweight_bundle started on Wed Nov 18 21:16:09 2020 and took 1000.31 milliseconds
5 : iteration#5/lightweight_bundle started on Wed Nov 18 21:16:10 2020 and took 1005.07 milliseconds
6 : iteration#6/lightweight_bundle started on Wed Nov 18 21:16:11 2020 and took 1005.03 milliseconds
7 : iteration#7/lightweight_bundle started on Wed Nov 18 21:16:12 2020 and took 1005.03 milliseconds
8 : iteration#8/lightweight_bundle started on Wed Nov 18 21:16:13 2020 and took 1005.04 milliseconds
9 : iteration#9/lightweight_bundle started on Wed Nov 18 21:16:14 2020 and took 1005.04 milliseconds

Finalizing...

[cpu]|0> Outputting 'timemory-ex-array-of-bundles-output/cpu.txt'...

|---------------------------------------------------------------------------------------------------------------------------|
|                                     TOTAL CPU TIME SPENT IN BOTH USER- AND KERNEL-MODE                                    |
|---------------------------------------------------------------------------------------------------------------------------|
|             LABEL               | COUNT  | DEPTH  | METRIC | UNITS  |  SUM   | MEAN   |  MIN   |  MAX   | STDDEV | % SELF |
|---------------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| >>> iteration#0/component_tuple |      1 |      0 | cpu    | sec    |  0.000 |  0.000 |  0.000 |  0.000 |  0.000 |    0.0 |
| >>> iteration#1/component_tuple |      1 |      0 | cpu    | sec    |  0.000 |  0.000 |  0.000 |  0.000 |  0.000 |    0.0 |
| >>> iteration#2/component_tuple |      1 |      0 | cpu    | sec    |  0.000 |  0.000 |  0.000 |  0.000 |  0.000 |    0.0 |
| >>> iteration#3/component_tuple |      1 |      0 | cpu    | sec    |  0.000 |  0.000 |  0.000 |  0.000 |  0.000 |    0.0 |
| >>> iteration#4/component_tuple |      1 |      0 | cpu    | sec    |  0.000 |  0.000 |  0.000 |  0.000 |  0.000 |    0.0 |
| >>> iteration#5/component_tuple |      1 |      0 | cpu    | sec    |  1.020 |  1.020 |  1.020 |  1.020 |  0.000 |  100.0 |
| >>> iteration#6/component_tuple |      1 |      0 | cpu    | sec    |  1.000 |  1.000 |  1.000 |  1.000 |  0.000 |  100.0 |
| >>> iteration#7/component_tuple |      1 |      0 | cpu    | sec    |  1.000 |  1.000 |  1.000 |  1.000 |  0.000 |  100.0 |
| >>> iteration#8/component_tuple |      1 |      0 | cpu    | sec    |  1.000 |  1.000 |  1.000 |  1.000 |  0.000 |  100.0 |
| >>> iteration#9/component_tuple |      1 |      0 | cpu    | sec    |  1.000 |  1.000 |  1.000 |  1.000 |  0.000 |  100.0 |
|---------------------------------------------------------------------------------------------------------------------------|

[wall]|0> Outputting 'timemory-ex-array-of-bundles-output/wall.txt'...

|---------------------------------------------------------------------------------------------------------------------------|
|                                          REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER)                                         |
|---------------------------------------------------------------------------------------------------------------------------|
|             LABEL               | COUNT  | DEPTH  | METRIC | UNITS  |  SUM   | MEAN   |  MIN   |  MAX   | STDDEV | % SELF |
|---------------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| >>> iteration#0/component_tuple |      1 |      0 | wall   | sec    |  1.001 |  1.001 |  1.001 |  1.001 |  0.000 |  100.0 |
| >>> iteration#1/component_tuple |      1 |      0 | wall   | sec    |  1.001 |  1.001 |  1.001 |  1.001 |  0.000 |  100.0 |
| >>> iteration#2/component_tuple |      1 |      0 | wall   | sec    |  1.001 |  1.001 |  1.001 |  1.001 |  0.000 |  100.0 |
| >>> iteration#3/component_tuple |      1 |      0 | wall   | sec    |  1.004 |  1.004 |  1.004 |  1.004 |  0.000 |  100.0 |
| >>> iteration#4/component_tuple |      1 |      0 | wall   | sec    |  1.000 |  1.000 |  1.000 |  1.000 |  0.000 |  100.0 |
| >>> iteration#5/component_tuple |      1 |      0 | wall   | sec    |  1.005 |  1.005 |  1.005 |  1.005 |  0.000 |  100.0 |
| >>> iteration#6/component_tuple |      1 |      0 | wall   | sec    |  1.005 |  1.005 |  1.005 |  1.005 |  0.000 |  100.0 |
| >>> iteration#7/component_tuple |      1 |      0 | wall   | sec    |  1.005 |  1.005 |  1.005 |  1.005 |  0.000 |  100.0 |
| >>> iteration#8/component_tuple |      1 |      0 | wall   | sec    |  1.005 |  1.005 |  1.005 |  1.005 |  0.000 |  100.0 |
| >>> iteration#9/component_tuple |      1 |      0 | wall   | sec    |  1.005 |  1.005 |  1.005 |  1.005 |  0.000 |  100.0 |
|---------------------------------------------------------------------------------------------------------------------------|

[data_unsigned]|0> Outputting 'timemory-ex-array-of-bundles-output/data_unsigned.txt'...

|--------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                  STORES UNSIGNED INTEGER DATA W.R.T. CALL-GRAPH                                                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------|
|             LABEL               | COUNT  | DEPTH  |    METRIC     | UNITS  |    SUM     |    MEAN    |    MIN     |    MAX     | STDDEV | % SELF |
|---------------------------------|--------|--------|---------------|--------|------------|------------|------------|------------|--------|--------|
| >>> iteration#0/component_tuple |      1 |      0 | data_unsigned |        | 1605762965 | 1605762965 | 1605762965 | 1605762965 |      0 |    100 |
| >>> iteration#1/component_tuple |      1 |      0 | data_unsigned |        | 1605762966 | 1605762966 | 1605762966 | 1605762966 |      0 |    100 |
| >>> iteration#2/component_tuple |      1 |      0 | data_unsigned |        | 1605762967 | 1605762967 | 1605762967 | 1605762967 |      0 |    100 |
| >>> iteration#3/component_tuple |      1 |      0 | data_unsigned |        | 1605762968 | 1605762968 | 1605762968 | 1605762968 |      0 |    100 |
| >>> iteration#4/component_tuple |      1 |      0 | data_unsigned |        | 1605762969 | 1605762969 | 1605762969 | 1605762969 |      0 |    100 |
| >>> iteration#5/component_tuple |      1 |      0 | data_unsigned |        | 1605762970 | 1605762970 | 1605762970 | 1605762970 |      0 |    100 |
| >>> iteration#6/component_tuple |      1 |      0 | data_unsigned |        | 1605762971 | 1605762971 | 1605762971 | 1605762971 |      0 |    100 |
| >>> iteration#7/component_tuple |      1 |      0 | data_unsigned |        | 1605762972 | 1605762972 | 1605762972 | 1605762972 |      0 |    100 |
| >>> iteration#8/component_tuple |      1 |      0 | data_unsigned |        | 1605762973 | 1605762973 | 1605762973 | 1605762973 |      0 |    100 |
| >>> iteration#9/component_tuple |      1 |      0 | data_unsigned |        | 1605762974 | 1605762974 | 1605762974 | 1605762974 |      0 |    100 |
|--------------------------------------------------------------------------------------------------------------------------------------------------|

Done!

Side note, all the bundles support serialization and operator+=, operator-=, etc. so w.r.t. the "how timing changes over the duration of a long running process?", you could actually load a previously saved output in your application and just do std::cout << "DIFFERENCE: " << (new - old) << std::endl;

@pwm1234-sri If you tried to compile this with MSVC, you may have noticed it does not compile because of "missing" overloads to tim::math::minus(...), etc. This is a strange issue with MSVC because it should not be instantiating the functions that call them since they are unused in the application, i.e. per cppreference.com:

Implicit instantiation

When code refers to a template in context that requires a completely defined type, or when the completeness of the type affects the code, and this particular type has not been explicitly instantiated, implicit instantiation occurs. For example, when an object of this type is constructed, but not when a pointer to this type is constructed.

This applies to the members of the class template: unless the member is used in the program, it is not instantiated, and does not require a definition.

So the solution is to change this:

// alias to a time-point type
using time_point_t = std::chrono::time_point<std::chrono::system_clock>;

// alias the templated data-tracker component for storing time-stamps
using time_point_tracker = data_tracker<time_point_t>;

to this:

// alias to a time-point type
using time_point_t = std::chrono::time_point<std::chrono::system_clock>;

// alias the templated data-tracker component for storing time-stamps
using time_point_tracker = data_tracker<time_point_t>;

namespace tim
{
namespace trait
{
// disable value-based call-stack tracking for time_pointer_tracker since time_point_t
// does not have overloads for math operations and because MSVC is instantiating
// templates which use them (even though they are unused in this application and therefore
// should not be instantiated)
template <>
struct uses_value_storage<time_point_tracker, time_point_t> : std::false_type
{};
}  // namespace trait
}  // namespace tim

(at least, it should work since this should disable where it appears the instantiations are originating in MSVC but I won't know until all the testing finishes)

Thanks for the help. I have not tried this yet. I hope to move to the new head early next week and then experiment with all of the great advice you have given me. I will certainly give you feedback as to how things go.

NERSC / timemory

Usage questions #105