Closed pwm1234-sri closed 2 years ago
Well, normally, adding a new component for another marker API is pretty straight-forward, e.g. the ones that already exist for Caliper, TAU, VTune, NVTX, etc. but the way OpenCV implements it makes it more difficult than normal.
cv::utils::trace::details::Region
) has an extremely restricted interface: Region(const LocationStaticStorage& location)
and that is it. LocationStaticStorage
is quite complicated. See CV__TRACE_DEFINE_LOCATION_
and CV__TRACE_DEFINE_LOCATION_FN
.Currently, doing this will require a heap allocation until I add a new bundle variant for handling components with this sort of non-default-ctor restriction.
Basically the component would look like this:
TIMEMORY_DECLARE_COMPONENT(opencv_trace)
#if !defined(USE_OPENCV)
TIMEMORY_DEFINE_CONCRETE_TRAIT(is_available, component::opencv_trace, false_type)
#endif
#if defined(USE_OPENCV)
namespace tim
{
namespace component
{
struct opencv_trace : public base<opencv_trace, void>
{
using region_t = CV_TRACE_NS::details::Region;
opencv_trace() = default;
~opencv_trace() = default;
opencv_trace(const opencv_trace&) = delete;
opencv_trace(opencv_trace&&) = default;
opencv_trace& operator=(const opencv_trace&) = delete;
opencv_trace& operator=(opencv_trace&&) = default;
static std::string label() { return "opencv_trace"; }
static std::string description() { return "Provides OpenCV tracing API"; }
void start(const LocationStaticStorage& _loc)
{
if(!m_data)
m_data.reset(new region_t{ _loc });
}
void stop() { m_data.reset(); }
private:
std::unique_ptr<region_t> m_data = {};
};
}
}
#endif
And now you can pass LocationStaticStorage
to the start(...)
function of the bundle. But you'll need something to reproduce this construction of LocationStaticStorage, like this:
namespace // ensure instantiation is unique to each compiled file
{
template <size_t, char> // template on line number and first char of name
const auto&
get_cv_trace_define_location_fn(const char _name[], const char _file[], int _line,
int _flags)
{
// from CV__TRACE_DEFINE_LOCATION_
static CV_TRACE_NS::details::Region::LocationExtraData** _data = nullptr;
static const CV_TRACE_NS::details::Region::LocationStaticStorage _inst =
= { &(_data), _name, _file, _line, _flags };
return _inst;
}
} // namespace
#define CV__TRACE_INLINE_LOCATION_FN(name, flags) \
get_cv_trace_define_location_fn<__LINE__, name[0]>( \
name, CV_TRACE_FILENAME, __LINE__, \
(flags | CV_TRACE_NS::details::REGION_FLAG_FUNCTION))
and finally, usage would resemble this (which could be hidden behind a macro:
tim::component_tuple<opencv_trace> _obj{ "opencv_trace_example" };
_obj.start(CV__TRACE_INLINE_LOCATION_FN("opencv_trace_example", 0));
If you look at the#define CV__TRACE_REGION(...)
vs. #define CV__TRACE_OPENCV_FUNCTION()
in core/utils/trace.hpp you will see that a CV__TRACE_OPENCV_FUNCTION
is essentially just CV__TRACE_REGION(__FUNCSIG__)
.
There is a templated tim::component::data_tracker<Tp, Tag>
where Tp
is the data type you want to track and Tag
is just an arbitrary struct to differentiate component X that tracks ints and component Y that tracks ints. You basically put it in a bundle and call the store(...)
function.
It can be quite useful to create a dedicated auto_tuple
bundle for the just the components which track data bc you can append to the history in a single line bc the auto_*
bundles call start()
and stop()
at construction/destruction.
This is probably best demonstrated with an example.
using namespace tim::component;
// component_tuple requires explicit start, stop is optional (will call stop if started when it gets destroyed)
using bundle_t = tim::component_tuple<opencv_trace, data_tracker_integer>;
// auto_tuple automatically calls start and stop
using tracker_t = tim::auto_tuple<data_tracker_integer>;
// auto_* bundles will write to stdout when destroyed
tim::settings::destructor_report() = true;
bundle_t _obj{ "opencv_test" };
_obj.start();
for(int i = 0; i < 10; ++i)
{
long ans = fibonacci(10) + fibonacci(10 + (i % 3));
// store a single iteration
tracker_t{ TIMEMORY_JOIN("", _obj.key(), "#", i % 3) }.store(ans);
// join macro is like pythons "#".join(...)
// accumulate into parent bundle and demonstrate using lambda
// specifying how to update variable
_obj.store([](long cur, long upd) { return cur + upd; }, ans);
}
The result:
>>> opencv_test#0 : 110 data_integer [laps: 1]
>>> opencv_test#1 : 144 data_integer [laps: 1]
>>> opencv_test#2 : 199 data_integer [laps: 1]
>>> opencv_test#0 : 110 data_integer [laps: 1]
>>> opencv_test#1 : 144 data_integer [laps: 1]
>>> opencv_test#2 : 199 data_integer [laps: 1]
>>> opencv_test#0 : 110 data_integer [laps: 1]
>>> opencv_test#1 : 144 data_integer [laps: 1]
>>> opencv_test#2 : 199 data_integer [laps: 1]
>>> opencv_test#0 : 110 data_integer [laps: 1]
[data_integer]|0> Outputting 'timemory-ex-derived-output/data_integer.json'...
[data_integer]|0> Outputting 'timemory-ex-derived-output/data_integer.tree.json'...
[data_integer]|0> Outputting 'timemory-ex-derived-output/data_integer.txt'...
|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| STORES SIGNED INTEGER DATA W.R.T. CALL-GRAPH |
|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | STDDEV | % SELF |
|---------------------|------------|------------|--------------|------------|------------|------------|------------|------------|------------|------------|
| >>> opencv_test | 1 | 0 | data_integer | | 1469 | 1469 | 1469 | 1469 | 0 | 0 |
| >>> |_opencv_test#0 | 4 | 1 | data_integer | | 440 | 110 | 110 | 110 | 0 | 100 |
| >>> |_opencv_test#1 | 3 | 1 | data_integer | | 432 | 144 | 144 | 144 | 0 | 100 |
| >>> |_opencv_test#2 | 3 | 1 | data_integer | | 597 | 199 | 199 | 199 | 0 | 100 |
|---------------------------------------------------------------------------------------------------------------------------------------------------------|
So I am currently working on getting a support for timemory in a python project called Hatchet which essentially takes the *.tree.json
output files and converts them into a Pandas dataframe for extended analysis. Also, in the longer term, we will be developing a framework for performance unit testing which will automate historical comparisons but, at present, the roll-your-own-solution would be to just import multiple jsons into Python and do a comparison there. The regular .json
files (w/o .tree
) provide the results in a flat JSON array so it is relatively easy to traverse (the json trees require recursion to process).
Also, you can specify TIMEMORY_INPUT_PATH=/path/to/some/folder/of/results
+ TIMEMORY_DIFF_OUTPUT=ON
in the environment (or the corresponding tim::settings::input_path()
, etc.) and when the results are finalized, timemory will search that input folder, try to find a corresponding input, and then generate additional *.diff.*
files in the output folder with the difference between any matches it can find.
One options is to prefix the files with some key (such as the PID):
tim::settings::output_prefix() = std::to_string(tim::process::get_id()) + "-";
Another options is to specify the TIMEMORY_TIME_OUTPUT=ON
in the env (tim::settings::time_output()
) and that will create subdirectories which are time-stamped. This time-stamp gets fixed for the process the first time settings::get_global_output_prefix()
is called and time-output is enabled. Also, there is a time_format()
in settings which allows you to customize the time-stamping format using the strftime
format.
You might find it helpful to peruse some of the unit-tests, such as the one for data_tracker.
Thank you for your detailed answer.
Regarding (2), counting user events, I will look at data_tracker and the unit test. I will also look over some of the other unit tests and examples. Are there other examples or tutorials describing how to use the timemory api to provide finer control over the instrumentation?
Regarding (3), profiling a long running process, simplistically, I think all I am looking for is to periodically restart the timemory system within the running process. For example, rather than waiting for the end of a 24-hour long process, I call finalize to output the current results and then re-initialize to restart a new collecion? Is it possible to call finalize and init multiple times within a single process? Of course, I will have to change the output path, so the previous results are not overwritten. Is there a better way to restart timemory?
Regarding (1), I think you missed the point of my question about OpenCV's tracing api. (Thank you. You did give me a lot of helpful information, and I will study it carefully as I learn more about timemory.) I am not trying to use timemory to call opencv's tracing function, I simply want to use timemory push/pop region functionality in the same, easy way that CV_TRACE_REGION_NEXT
works. That is, rather than having to do the two step (1) pop the current region and (2) push a new region, trace region next does the two steps in one.
My guess is that this would be more easily achieved using the timemory c++ api than its library api calls to timemory_push_region
and timemory_pop_region
, but I am not familiar with the timemory c++ api. What is a good example to study? The region next functionality needs only a little more than a simplistic RAII class that calls push region in the constructor and pop region in the destrcutor; but it additionally needs a next method that calls pop and then push.
At this point, the C++ template API doesn't have an official tutorial. There are some docs on the API and the examples and tests but that's currently all thats available. Unfortunately, right as we were introducing timemory via it's first presentation at the International Supercomputing Conference in Germany, the pandemic hit and it's quite difficult to give hands-on tutorials virtually (which is when those materials will be created). Once travel resumes, I expect there will be quite a bit of materials created for this but without the feedback these provide on what people want/need to know, creating those materials are a bit of a guessing-game.
First off, do not call timemory_finalize()
and then try use timemory after that point. That routine is designed to delete a lot of things and segfaults are basically a certainty. However, you can pretty much dump the current state of the call-stack storage at any point for any thread on any process any time you want as long as you are fine with the side-effects. And using this approach, you can have full control of where the files get written. The side-effects are:
settings::stack_clearing() = false
storage<Tp>::instance()->get()
on the master thread, it will merge in the child threads and clear their storagestorage<Tp>::instance()->dmp_get()
(distributed memory parallelism, i.e. MPI or UPC++) on rank zero it will return the storage from all processes.The best references for how to dump outputs are in the source/timemory/operations/types/finalize folder. The forward declarations (and a few definitions) are here. You can see examples of how they are invoked in source/timemory/storage/definition.hpp (and further down: here from 753 to 926)
If you aren't used to template meta-programming, you may find it difficult to create a generic function which does some operation on all components -- if that happens, just let me know and I can provide you with something very easily. The python bindings probably do what you are looking for, e.g.:
import timemory
def write_json(fname, specific_types = [], hierarchy = False):
data = timemory.get(hierarchy=hierarchy, components=specific_types)
with open(fname, "w") as f:
f.write(data)
And those bindings are here from 99 to 183 and would be invoked like this (minus the stuff specific to python in _as_json
):
auto _as_json_classic = [](const std::set<int>& _types) {
using tuple_type = tim::convert_t<tim::available_types_t, std::tuple<>>;
auto json_str = manager_t::get_storage<tuple_type>::serialize(_types);
return json_str;
};
auto _as_json_hierarchy = [](const std::set<int>& _types) {
std::stringstream ss;
{
using policy_type = tim::policy::output_archive_t<tim::manager>;
auto oa = policy_type::get(ss);
oa->setNextName("timemory");
oa->startNode();
get_json(*oa, _types, std::make_index_sequence<TIMEMORY_COMPONENTS_END>{});
oa->finishNode();
}
return ss.str();
};
auto _as_json = [_as_json_classic, _as_json_hierarchy](bool hierarchy,
py::list pytypes) {
auto types = pytim::get_enum_set(pytypes); // convert py::list into std::set<int>
auto json_str = (hierarchy) ? _as_json_hierarchy(types) : _as_json_classic(types);
auto json_module = py::module::import("json");
return json_module.attr("loads")(json_str);
};
Ah yes indeed I did miss the point of (1). My gut feels like providing this functionality would be problematic in terms of introducing unnecessary overhead. Let me think about it more and get back to you.
I understand that conferences and papers are a big driver of generating documentation. If you want a reviewer (or trial attendee), I will volunteer!
For what it is worth I would like to see a timemory version of what is in the opencv tracing example, https://github.com/alalek/opencv/blob/trace/samples/cpp/application_trace.cpp. I would be happy to contribute. And from the perspective of this demo, I think the benefits of the simplicity of the trace next capability far outweighs the performance overhead. Of course I understand that you do not want to have a deliberate inefficient api call; but you also have to balance that with maxim of making simple things simple.
I think this application_trace example would also a good demonstration of writing the output while processing. if there are a million frames, then one might want to write the output every hundred thousand frames to make sure the processing performs as expected over the range of frames. (And, so you get intermediate results in case something happens and the application code crashes before finalizing - a very real possibility in many cases.)
Have you seen the examples in ex-minimal? I probably do need to address the examples -- even I look at them sometimes and find that some of them are unnecessarily complex.
In early versions, they were also used for testing, e.g. ex-cxx-basic, ex-cxx-overhead, and ex-cuda-event, still have the old RUN_TEST(...)
stuff that was hacked together before I migrated to Google-Test.
In the unit-tests, I frequently dump the storage mid-run to verify things are being collected as expected so there isn't any real concern about the functionality working but I have done a poor job of making it clear which examples/tests showcase which features and I expect my former intern @mhaseeb123 (who generated almost all of the README.md files in the examples) probably had that issue too.
W.r.t. the trace-next thing, I am not sure I quite see how introducing a new macro is better than the unnecessary curly braces comment.
The OpenCV implementation is much more simplistic than timemory and doesn't have to deal with things like this:
using foo_t = auto_tuple<wall_clock, cpu_clock>;
using bar_t = auto_tuple<wall_clock, peak_rss>;
TIMEMORY_BLANK_MARKER(foo_t, "A");
# ... something ...
TIMEMORY_BLANK_MARKER_NEXT(bar_t, "B");
# ... something ...
would the theoretical TIMEMORY_BLANK_MARKER_NEXT
close the scope of the wall_clock
and cpu_clock
components in foo_t
? Or just close wall_clock
but not close the scope of the cpu_clock
component?
Also, their implementation only uses that single Region
class so it is relatively trivial from them to just keep a Region*
"parent" record (or however they do it), whereas the modularity of timemory makes parent-child relationships more complex to track. The only place it is really tracked is in the storage class and storage is optional: you use wall_clock
directly or lightweight_tuple<wall_clock>
and there is absolutely zero implicit interaction with storage. Further complicating the parent instance issue is that timemory fully supports asynchronous usage and the "*_NEXT" things could cause all sorts of issues by usage like this:
TIMEMORY_BLANK_MARKER(foo_t, "A");
auto _func = [=]()
{
TIMEMORY_BLANK_MARKER_NEXT(foo_t, "B");
// ... etc. ...
};
auto fut_0 = std::async(std::launch::async, _func);
auto fut_1 = std::async(std::launch::deferred, _func);
TIMEMORY_BLANK_MARKER_NEXT(foo_t, "C");
fut_0.wait();
fut_1.wait();
In some instances, "C" wil close "A" but the first async might close "A" if it is launched immediately. Or the first async might be launched in another thread and is then closing the scope of something in another thread. Or you might have a data race to close "A", etc.
I think OpenCV doesn't have to worry about this the way timemory does bc they have much more full control over how threads are used internally and the framework itself probably imposes restrictions about running OpenCV within an external threading system (e.g. set OpenCV to serial within OpenMP, etc.).
Thanks again for your detailed explanation. I see that the simplistic opencv approach may not be appropriate for the general purpose timemory api. But for simple application code, I usually have a top-level function very much like the main function in the [opencv trace example(https://github.com/alalek/opencv/blob/006966e6295f44e48ade066d4eb56826439680fb/samples/cpp/application_trace.cpp#L22). Most of the time that function is:
which very closely matches the opencv example. Putting trace next inside the main processing loop and the function of pipeline processing calls is appealing. Avoiding the unnecessary curly braces just makes it easy to patch in performance instrumentation calls without having to change anything else. Using curly braces for scope between steps means you have to restructure any local variables that may be shared between steps. Or having to manually add pop/push pairs (which is not a big deal, just not as simple as trace-next).
And I am belaboring this point (for this last time) because we often use opencv and may have code where people will use the opencv tracing capability. I want to replace or re-implement their usage of opencv tracing with timemory. For simple instrumentation needs, it would be nice to simply redefine the opencv macros so they are implemented using timemory. But I will now drop this topic as you have helped me greatly better understand how timemory differs from opencv tracing. I hope that I have at least help contribute to future documentation or for conference tutorial material.
I have looked at ex-minimal, but I have not studied it enough to reasonably understand it. I think my next timemory steps are to:
If you have a suggestion for how I can at least contribute some documentation material, I will certainly try to help at least keep track of all the great feedback you have given to me. I do appreciate your time!
I think the solution for the opencv-tracing would be create a new component bundle type. Theoretically, it would look like this (this is mostly for me while the idea is in my head):
namespace quirk {
struct stop_last_instance {};
}
namespace internal {
template <typename Tp>
auto*& last_instance()
{
static thread_local Tp* _instance = nullptr;
return _instance;
}
}
template <typename... Types>
struct history_bundle : component_bundle<project::timemory, Types...>
{
static constexpr bool has_stop_last_instance = is_one_of<quirk::stop_last_instance, type_list<Types...>>::value;
static constexpr bool has_explicit_start = is_one_of<quirk::explicit_start, type_list<Types...>>::value;
using base_type = component_bundle<project::timemory, Types...>;
using this_type = history_bundle<Types...>;
using next_type = conditional_t<has_stop_last_instance,
this_type,
history_bundle<Types..., quirk::stop_last_instance>>;
using auto_type = this_type;
template <typename... Args>
history_bundle(Args&&... args)
: base_type(std::forward<Args>(args)...)
{
auto*& _last = internal::last_instance<next_type>();
IF_CONSTEXPR(has_stop_last_instance)
{
if(_last) _last.stop();
}
_last = this;
IF_CONSTEXPR(!has_explicit_start)
{
this->start();
}
}
~history_bundle()
{
auto*& _last = internal::last_instance<next_type>();
if(_last == this) { _last = nullptr; }
}
};
// todo: add safer way check there is a 'next_type' alias
// and use a static_assert with a clear error message
#define TIMEMORY_NEXT_TYPE(TYPE) typename TYPE::next_type
#define TIMEMORY_BLANK_MARKER_NEXT(TYPE, ...) \
_TIM_STATIC_SRC_LOCATION(blank, __VA_ARGS__); \
TIMEMORY_NEXT_TYPE(TYPE) \
_TIM_VARIABLE(__LINE__)(TIMEMORY_CAPTURE_ARGS(__VA_ARGS__))
Take a gander at the macros in source/timemory/variadic/macros.hpp and lmk which ones you would like to have _NEXT
variants, just in case I don't want to make this available for all of them.
If you have a suggestion for how I can at least contribute some documentation material
I was probably going to add most/all of these questions + responses to FAQ Wiki but I think if you could just keep a list of notable examples of features in the examples/tests that you come across would be tremendously helpful in creating a "Examples of Features Cross-Reference" page in the wiki.
E.g.
storage<T>::instance->mpi_get()
...
At this time I do not believe I understand the API enough to give you an informed opinion on which macros I would like to have _NEXT
variants.
Given the subtleties of the richer API of timemory over the opencv tracing api, I am going to defer the next topic and concentrate on the topics of 1) writing output while running, and 2) data tracker. These are more relevant to my current needs.
I wrote an example for you that I will include in the next PR which:
data_tracker_unsigned
(i.e. data_tracker<size_t>
)std::chrono::time_point<std::chrono::system_clock>
(i.e. data_tracker<time_point>
)+=
itself to that address. The storage class is just implementation sugar -- components don't rely on it for any functionality whatsoever. As you will see in the example, you can create a bundle, store each bundle instance in a regular STL vector and then manipulate it later -- all the storage class really does is provide a graph instead of a vector.cmake_minimum_required(VERSION 3.11 FATAL_ERROR)
project(timemory-Array-of-Bundles-Example LANGUAGES CXX)
# set the name of the target you want timemory to add include-directories,
# libraries, etc. to. Just link this library to your target and it will
# include all the necessary build data
set(timemory_FIND_COMPONENTS_INTERFACE timemory-array-of-bundles-example)
# find the timemory package and specify you want the C++ library (cxx)
find_package(timemory REQUIRED
COMPONENTS cxx)
# create your executable
add_executable(ex_array_of_bundles
ex_array_of_bundles.cpp)
# link your executable to the interface target created by timemory
target_link_libraries(ex_array_of_bundles
PRIVATE timemory-array-of-bundles-example)
# optionally install this executable
install(TARGETS ex_array_of_bundles DESTINATION bin OPTIONAL)
#include "timemory/timemory.hpp"
#include <chrono>
#include <iostream>
// Example demonstrating:
// 1. Differences between bundles
// 2. How to control the output
// 3. Creating custom data trackers
using tim::component::cpu_clock;
using tim::component::data_tracker;
using tim::component::data_tracker_unsigned;
using tim::component::wall_clock;
// alias to a time-point type
using time_point_t = std::chrono::time_point<std::chrono::system_clock>;
// alias the templated data-tracker component for storing time-stamps
using time_point_tracker = data_tracker<time_point_t>;
// TODO: remove the need for this
// Below is due to a bug in data-tracker
namespace std
{
time_point_t
operator/(const time_point_t& _tp, const int64_t&);
//
ostream&
operator<<(ostream& _os, const time_point_t&);
} // namespace std
int
main(int argc, char** argv)
{
// configure these default settings before timemory_init to allow for
// environment overrides
tim::settings::cout_output() = true; // print to stdout
tim::settings::text_output() = true; // write text file
tim::settings::json_output() = false; // don't write flat json files
tim::settings::tree_output() = false; // don't write hierarchical json files
tim::settings::time_output() = false; // time-stamp the output folder
tim::settings::flamegraph_output() = false; // don't write flamegraph outputs
//--------------------------------------------------------------------------------//
// THINGS TO TRY:
// - Use timemory-avail command-line tool with -S option to view available settings
// - Add -d option to above to see descriptions of the settings
//--------------------------------------------------------------------------------//
// initialize timemory
tim::timemory_init(argc, argv);
//---------------------------------------------------------------------//
// Demonstrate the difference b/t these two bundles
//---------------------------------------------------------------------//
// this bundle does not implicitly store persistent data. All data will
// be local to the components in this bundle unless "push()" and "pop()"
// are explicitly called. Thus, this bundle is useful if you want to
// handle your own data-storage or do one initial push, and then take
// several short measurements before doing one final pop which updates
// the global call-stack stack with the final value.
using lbundle_t = tim::lightweight_tuple<wall_clock, cpu_clock, data_tracker_unsigned,
time_point_tracker>;
// note: time_point_tracker can be used in lightweight_tuple as long as
// it does not interact with storage. Interacting with storage requires
// some overloads and specializations due to the fact that time_point_t
// doesn't support operator such as +=, -=, <<, and so on. It is
// safe to remove from above if you want to experiment with pushing and
// popping (nothing else below needs to be modified beyond that)
// this bundle will implicitly "push" to persistent storage when start is
// called. A "push" operation makes a bookmark in the global call-stack.
// When stop is called, this bundle will implicitly "pop" itself off the
// global call-stack. As part of the "pop" operation, this component will
// add its internal measurement data to the component instance in the global
// call-stack and then reset its values to zero.
using cbundle_t = tim::component_tuple<wall_clock, cpu_clock, data_tracker_unsigned>;
//--------------------------------------------------------------------------------//
// THINGS TO TRY:
// - add "cpu_util" to one or both of the aliases above
// - add "user_global_bundle" to one or both of the alias above
// - set environment variable TIMEMORY_GLOBAL_COMPONENTS to "system_clock,
// user_clock"
// and run the application
// - user_global_bundle::configure<system_clock>()
//--------------------------------------------------------------------------------//
// standard container for storing each iteration instance
std::vector<lbundle_t> _ldata{};
std::vector<cbundle_t> _cdata{};
// number of iterations
int nitr = 5;
if(argc > 1)
nitr = std::stoi(argv[1]);
int ndump = 1;
if(argc > 2)
ndump = std::max<int>(1, std::stoi(argv[2]));
for(int i = 0; i < nitr; ++i)
{
std::cout << "Performing iteration " << i << "..." << std::endl;
// the base label for the measurements
std::string _label = TIMEMORY_JOIN("#", "iteration", i);
// create an instance for this loop and add a custom suffix for the
// different types. When the application terminates, you will see
// the "component_tuple" entries in the final report only.
_ldata.emplace_back(lbundle_t{ _label + "/lightweight_bundle" });
_cdata.emplace_back(cbundle_t{ _label + "/component_tuple" });
//----------------------------------------------------------------------------//
// THINGS TO TRY:
// _ldata.back.push();
//----------------------------------------------------------------------------//
// start the measurements
_ldata.back().start();
_cdata.back().start();
// get a time-stamp
time_point_t _now = std::chrono::system_clock::now();
// store the time-stamp in time_point_tracker
_ldata.back().store(_now);
// convert the time-stamp to the number of seconds since the epoch
size_t _epoch =
std::chrono::duration_cast<std::chrono::seconds>(_now.time_since_epoch())
.count();
// store the epoch value in data_tracker_unsigned
_ldata.back().store(_epoch);
_cdata.back().store(_epoch);
// sleep for one second for half the iteration
if(i < nitr / 2)
std::this_thread::sleep_for(std::chrono::seconds(1));
else // consume cpu-cycles for the other half
while(std::chrono::system_clock::now() <
(_now + std::chrono::milliseconds(1005)))
;
// stop the measurements
_ldata.back().stop();
_cdata.back().stop();
//----------------------------------------------------------------------------//
// THINGS TO TRY:
// _ldata.back().pop();
//----------------------------------------------------------------------------//
// skip based on modulus
if(i % ndump != 0)
continue;
// write a json serialization of the global call-stack
std::string outfname = tim::settings::compose_input_filename(
TIMEMORY_JOIN('-', "iteration", i, "results"), ".json");
std::ofstream ofs(outfname);
if(ofs)
{
std::cout << "--> Writing " << outfname << "..." << std::endl;
ofs << tim::manager::get_storage<tim::available_types_t>::serialize();
}
}
// write a json serialization of the global call-stack
if(nitr % ndump == 0)
{
// basename for the file that will be written
std::string _bname = TIMEMORY_JOIN('-', "iteration", nitr, "results");
// compose_output_filename appends the output_path and output_prefix
// in settings to the first parameter, appends a time-stamped directory
// if that is enabled, and so on. Then, it just tacks on the extension provided.
std::string outfname = tim::settings::compose_output_filename(_bname, ".json");
std::ofstream ofs(outfname);
if(ofs)
{
std::cout << "--> Writing " << outfname << "..." << std::endl;
// tim::available_types_t is just a tim::type_list<...> of all the
// components that were not marked as unavailable (and thus
// definitions are provided for them). This manager function opens a
// cereal::JSONOutputArchive on a stringstream and does a
// compile-time loop over all of those types. For each type, it
// checks if the storage has any data and if it does, it serializes
// the call-graph for that type into the archive. Once the loop
// is completed, it converts the sstream into a string and returns
// the string.
ofs << tim::manager::get_storage<tim::available_types_t>::serialize();
//------------------------------------------------------------------------//
// THINGS TO TRY:
// - replace available_types_t with specific components, e.g. wall_clock
//------------------------------------------------------------------------//
}
}
// use the stream operator of the bundle to print out data
size_t _cnt = 0;
std::cout << "\nLIGHTWEIGHT BUNDLE DATA:" << std::endl;
for(auto& itr : _ldata)
std::cout << _cnt++ << " : " << itr << std::endl;
_cnt = 0;
std::cout << "\nCOMPONENT TUPLE DATA:" << std::endl;
for(auto& itr : _cdata)
std::cout << _cnt++ << " : " << itr << std::endl;
//--------------------------------------------------------------------------------//
// THINGS TO TRY:
// - add this to top of file after the include statement:
//
// TIMEMORY_DECLARE_COMPONENT(fake_component)
// TIMEMORY_DEFINE_CONCRETE_TRAIT(is_available, component::fake_component,
// false_type)
//
// - add fake_component to the bundles and then compile and run.
// - you will not see anything related to fake_component
// - a component marked as unavailable is never instantiated in the template -> no
// definition of the type is ever needed
//
// - change "false_type" to "true_type" and then try to compile (hint: will fail)
// - when a component is available, a definition will be needed
//--------------------------------------------------------------------------------//
// demonstrate data access
_cnt = 0;
std::cout << "\nLIGHTWEIGHT BUNDLE DATA ACCESS:" << std::endl;
for(auto& itr : _ldata)
{
// get the instance of a wall-clock component in the bundle
wall_clock* _wc = itr.get<wall_clock>();
// since wall-clock was part of a "tuple" type, this will always
// be a valid pointer as long as tim::is_available<wall_clock>::value
// is true at compile-time. If _ldata was a "list" bundle, this
// would be a null pointer unless you previously called
// _ldata.initialize<wall_clock>()
if(!_wc)
continue;
// the "get()" member function returns the accumulated measurement value
double _elapsed = _wc->get();
// units are static properties of type so convert this measurement to milliseconds
_elapsed *= tim::units::msec / wall_clock::unit();
// get the instance of the unsigned data-tracker component
time_point_tracker* _tp = itr.get<time_point_tracker>();
if(!_tp)
continue;
// get the time-point value
time_point_t _point = _tp->get();
// convert to time_t
std::time_t _today = std::chrono::system_clock::to_time_t(_point);
std::string _ctime = std::ctime(&_today);
// report when the measurement was taken and how long it took.
std::cout << _cnt++ << " : " << itr.key() << " started on "
<< _ctime.substr(0, _ctime.find('\n')) << " and took " << _elapsed
<< " milliseconds" << std::endl;
}
puts("\nFinalizing...\n");
// generate final reports, cleanup memory
tim::timemory_finalize();
//----------------------------------------------------------------------------//
// THINGS TO TRY:
// - view the text files
// - enable json output and import the data into python
//----------------------------------------------------------------------------//
puts("Done!");
}
// TODO: remove the need for this
// Below is due to a bug in data-tracker
namespace std
{
time_point_t
operator/(const time_point_t& _tp, const int64_t&)
{
return _tp;
}
//
std::ostream&
operator<<(std::ostream& _os, const time_point_t&)
{
return _os;
}
} // namespace std
$ ./ex_array_of_bundles 10 2
Performing iteration 0...
--> Writing timemory-ex-array-of-bundles-output/iteration-0-results.json...
Performing iteration 1...
Performing iteration 2...
--> Writing timemory-ex-array-of-bundles-output/iteration-2-results.json...
Performing iteration 3...
Performing iteration 4...
--> Writing timemory-ex-array-of-bundles-output/iteration-4-results.json...
Performing iteration 5...
Performing iteration 6...
--> Writing timemory-ex-array-of-bundles-output/iteration-6-results.json...
Performing iteration 7...
Performing iteration 8...
--> Writing timemory-ex-array-of-bundles-output/iteration-8-results.json...
Performing iteration 9...
--> Writing timemory-ex-array-of-bundles-output/iteration-10-results.json...
LIGHTWEIGHT BUNDLE DATA:
0 : >>> iteration#0/lightweight_bundle : 1.002 sec wall, 0.000 sec cpu, 1605762965 data_unsigned, [laps: 1]
1 : >>> iteration#1/lightweight_bundle : 1.001 sec wall, 0.000 sec cpu, 1605762966 data_unsigned, [laps: 1]
2 : >>> iteration#2/lightweight_bundle : 1.001 sec wall, 0.000 sec cpu, 1605762967 data_unsigned, [laps: 1]
3 : >>> iteration#3/lightweight_bundle : 1.004 sec wall, 0.000 sec cpu, 1605762968 data_unsigned, [laps: 1]
4 : >>> iteration#4/lightweight_bundle : 1.000 sec wall, 0.000 sec cpu, 1605762969 data_unsigned, [laps: 1]
5 : >>> iteration#5/lightweight_bundle : 1.005 sec wall, 1.020 sec cpu, 1605762970 data_unsigned, [laps: 1]
6 : >>> iteration#6/lightweight_bundle : 1.005 sec wall, 1.000 sec cpu, 1605762971 data_unsigned, [laps: 1]
7 : >>> iteration#7/lightweight_bundle : 1.005 sec wall, 1.000 sec cpu, 1605762972 data_unsigned, [laps: 1]
8 : >>> iteration#8/lightweight_bundle : 1.005 sec wall, 1.000 sec cpu, 1605762973 data_unsigned, [laps: 1]
9 : >>> iteration#9/lightweight_bundle : 1.005 sec wall, 1.000 sec cpu, 1605762974 data_unsigned, [laps: 1]
COMPONENT TUPLE DATA:
0 : >>> iteration#0/component_tuple : 1.001 sec wall, 0.000 sec cpu, 1605762965 data_unsigned [laps: 1]
1 : >>> iteration#1/component_tuple : 1.001 sec wall, 0.000 sec cpu, 1605762966 data_unsigned [laps: 1]
2 : >>> iteration#2/component_tuple : 1.001 sec wall, 0.000 sec cpu, 1605762967 data_unsigned [laps: 1]
3 : >>> iteration#3/component_tuple : 1.004 sec wall, 0.000 sec cpu, 1605762968 data_unsigned [laps: 1]
4 : >>> iteration#4/component_tuple : 1.000 sec wall, 0.000 sec cpu, 1605762969 data_unsigned [laps: 1]
5 : >>> iteration#5/component_tuple : 1.005 sec wall, 1.020 sec cpu, 1605762970 data_unsigned [laps: 1]
6 : >>> iteration#6/component_tuple : 1.005 sec wall, 1.000 sec cpu, 1605762971 data_unsigned [laps: 1]
7 : >>> iteration#7/component_tuple : 1.005 sec wall, 1.000 sec cpu, 1605762972 data_unsigned [laps: 1]
8 : >>> iteration#8/component_tuple : 1.005 sec wall, 1.000 sec cpu, 1605762973 data_unsigned [laps: 1]
9 : >>> iteration#9/component_tuple : 1.005 sec wall, 1.000 sec cpu, 1605762974 data_unsigned [laps: 1]
LIGHTWEIGHT BUNDLE DATA ACCESS:
0 : iteration#0/lightweight_bundle started on Wed Nov 18 21:16:05 2020 and took 1001.63 milliseconds
1 : iteration#1/lightweight_bundle started on Wed Nov 18 21:16:06 2020 and took 1000.82 milliseconds
2 : iteration#2/lightweight_bundle started on Wed Nov 18 21:16:07 2020 and took 1001.08 milliseconds
3 : iteration#3/lightweight_bundle started on Wed Nov 18 21:16:08 2020 and took 1004.05 milliseconds
4 : iteration#4/lightweight_bundle started on Wed Nov 18 21:16:09 2020 and took 1000.31 milliseconds
5 : iteration#5/lightweight_bundle started on Wed Nov 18 21:16:10 2020 and took 1005.07 milliseconds
6 : iteration#6/lightweight_bundle started on Wed Nov 18 21:16:11 2020 and took 1005.03 milliseconds
7 : iteration#7/lightweight_bundle started on Wed Nov 18 21:16:12 2020 and took 1005.03 milliseconds
8 : iteration#8/lightweight_bundle started on Wed Nov 18 21:16:13 2020 and took 1005.04 milliseconds
9 : iteration#9/lightweight_bundle started on Wed Nov 18 21:16:14 2020 and took 1005.04 milliseconds
Finalizing...
[cpu]|0> Outputting 'timemory-ex-array-of-bundles-output/cpu.txt'...
|---------------------------------------------------------------------------------------------------------------------------|
| TOTAL CPU TIME SPENT IN BOTH USER- AND KERNEL-MODE |
|---------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | STDDEV | % SELF |
|---------------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| >>> iteration#0/component_tuple | 1 | 0 | cpu | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0 |
| >>> iteration#1/component_tuple | 1 | 0 | cpu | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0 |
| >>> iteration#2/component_tuple | 1 | 0 | cpu | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0 |
| >>> iteration#3/component_tuple | 1 | 0 | cpu | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0 |
| >>> iteration#4/component_tuple | 1 | 0 | cpu | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0 |
| >>> iteration#5/component_tuple | 1 | 0 | cpu | sec | 1.020 | 1.020 | 1.020 | 1.020 | 0.000 | 100.0 |
| >>> iteration#6/component_tuple | 1 | 0 | cpu | sec | 1.000 | 1.000 | 1.000 | 1.000 | 0.000 | 100.0 |
| >>> iteration#7/component_tuple | 1 | 0 | cpu | sec | 1.000 | 1.000 | 1.000 | 1.000 | 0.000 | 100.0 |
| >>> iteration#8/component_tuple | 1 | 0 | cpu | sec | 1.000 | 1.000 | 1.000 | 1.000 | 0.000 | 100.0 |
| >>> iteration#9/component_tuple | 1 | 0 | cpu | sec | 1.000 | 1.000 | 1.000 | 1.000 | 0.000 | 100.0 |
|---------------------------------------------------------------------------------------------------------------------------|
[wall]|0> Outputting 'timemory-ex-array-of-bundles-output/wall.txt'...
|---------------------------------------------------------------------------------------------------------------------------|
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|---------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | STDDEV | % SELF |
|---------------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| >>> iteration#0/component_tuple | 1 | 0 | wall | sec | 1.001 | 1.001 | 1.001 | 1.001 | 0.000 | 100.0 |
| >>> iteration#1/component_tuple | 1 | 0 | wall | sec | 1.001 | 1.001 | 1.001 | 1.001 | 0.000 | 100.0 |
| >>> iteration#2/component_tuple | 1 | 0 | wall | sec | 1.001 | 1.001 | 1.001 | 1.001 | 0.000 | 100.0 |
| >>> iteration#3/component_tuple | 1 | 0 | wall | sec | 1.004 | 1.004 | 1.004 | 1.004 | 0.000 | 100.0 |
| >>> iteration#4/component_tuple | 1 | 0 | wall | sec | 1.000 | 1.000 | 1.000 | 1.000 | 0.000 | 100.0 |
| >>> iteration#5/component_tuple | 1 | 0 | wall | sec | 1.005 | 1.005 | 1.005 | 1.005 | 0.000 | 100.0 |
| >>> iteration#6/component_tuple | 1 | 0 | wall | sec | 1.005 | 1.005 | 1.005 | 1.005 | 0.000 | 100.0 |
| >>> iteration#7/component_tuple | 1 | 0 | wall | sec | 1.005 | 1.005 | 1.005 | 1.005 | 0.000 | 100.0 |
| >>> iteration#8/component_tuple | 1 | 0 | wall | sec | 1.005 | 1.005 | 1.005 | 1.005 | 0.000 | 100.0 |
| >>> iteration#9/component_tuple | 1 | 0 | wall | sec | 1.005 | 1.005 | 1.005 | 1.005 | 0.000 | 100.0 |
|---------------------------------------------------------------------------------------------------------------------------|
[data_unsigned]|0> Outputting 'timemory-ex-array-of-bundles-output/data_unsigned.txt'...
|--------------------------------------------------------------------------------------------------------------------------------------------------|
| STORES UNSIGNED INTEGER DATA W.R.T. CALL-GRAPH |
|--------------------------------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | STDDEV | % SELF |
|---------------------------------|--------|--------|---------------|--------|------------|------------|------------|------------|--------|--------|
| >>> iteration#0/component_tuple | 1 | 0 | data_unsigned | | 1605762965 | 1605762965 | 1605762965 | 1605762965 | 0 | 100 |
| >>> iteration#1/component_tuple | 1 | 0 | data_unsigned | | 1605762966 | 1605762966 | 1605762966 | 1605762966 | 0 | 100 |
| >>> iteration#2/component_tuple | 1 | 0 | data_unsigned | | 1605762967 | 1605762967 | 1605762967 | 1605762967 | 0 | 100 |
| >>> iteration#3/component_tuple | 1 | 0 | data_unsigned | | 1605762968 | 1605762968 | 1605762968 | 1605762968 | 0 | 100 |
| >>> iteration#4/component_tuple | 1 | 0 | data_unsigned | | 1605762969 | 1605762969 | 1605762969 | 1605762969 | 0 | 100 |
| >>> iteration#5/component_tuple | 1 | 0 | data_unsigned | | 1605762970 | 1605762970 | 1605762970 | 1605762970 | 0 | 100 |
| >>> iteration#6/component_tuple | 1 | 0 | data_unsigned | | 1605762971 | 1605762971 | 1605762971 | 1605762971 | 0 | 100 |
| >>> iteration#7/component_tuple | 1 | 0 | data_unsigned | | 1605762972 | 1605762972 | 1605762972 | 1605762972 | 0 | 100 |
| >>> iteration#8/component_tuple | 1 | 0 | data_unsigned | | 1605762973 | 1605762973 | 1605762973 | 1605762973 | 0 | 100 |
| >>> iteration#9/component_tuple | 1 | 0 | data_unsigned | | 1605762974 | 1605762974 | 1605762974 | 1605762974 | 0 | 100 |
|--------------------------------------------------------------------------------------------------------------------------------------------------|
Done!
Side note, all the bundles support serialization and operator+=
, operator-=
, etc. so w.r.t. the "how timing changes over the duration of a long running process?", you could actually load a previously saved output in your application and just do std::cout << "DIFFERENCE: " << (new - old) << std::endl;
@pwm1234-sri If you tried to compile this with MSVC, you may have noticed it does not compile because of "missing" overloads to tim::math::minus(...)
, etc. This is a strange issue with MSVC because it should not be instantiating the functions that call them since they are unused in the application, i.e. per cppreference.com:
When code refers to a template in context that requires a completely defined type, or when the completeness of the type affects the code, and this particular type has not been explicitly instantiated, implicit instantiation occurs. For example, when an object of this type is constructed, but not when a pointer to this type is constructed.
This applies to the members of the class template: unless the member is used in the program, it is not instantiated, and does not require a definition.
So the solution is to change this:
// alias to a time-point type
using time_point_t = std::chrono::time_point<std::chrono::system_clock>;
// alias the templated data-tracker component for storing time-stamps
using time_point_tracker = data_tracker<time_point_t>;
to this:
// alias to a time-point type
using time_point_t = std::chrono::time_point<std::chrono::system_clock>;
// alias the templated data-tracker component for storing time-stamps
using time_point_tracker = data_tracker<time_point_t>;
namespace tim
{
namespace trait
{
// disable value-based call-stack tracking for time_pointer_tracker since time_point_t
// does not have overloads for math operations and because MSVC is instantiating
// templates which use them (even though they are unused in this application and therefore
// should not be instantiated)
template <>
struct uses_value_storage<time_point_tracker, time_point_t> : std::false_type
{};
} // namespace trait
} // namespace tim
(at least, it should work since this should disable where it appears the instantiations are originating in MSVC but I won't know until all the testing finishes)
Thanks for the help. I have not tried this yet. I hope to move to the new head early next week and then experiment with all of the great advice you have given me. I will certainly give you feedback as to how things go.
Now that I have a working windows build of timemory, I have a few usage questions.
CV_TRACE_REGION_NEXT
? See Tracing macro list for details, but it is simply a way to easily use regions for different steps.So far I have only used the library api functions
timemory_push_region
andtimemory_pop_region
. But I believe this is more easily done using the timemory c++ api, but I am not sure how to do this or where to look for guidance.Another question related to how to transition from the opencv tracing macros to timemory: opencv makes a distinction between tracing a function and tracing a region. Is such a distinction helpful? If so, how do I make this distinction with timemory? (If it matters, my primary backend profiling/analysis tool is VTune.)
How do I use timemory to count user defined events? For example, with an image processing application, I would like to count the number of images read, written, processed, displayed, copied, etc. Can timemory help me keep track of these events?
Can I use timemory to to profile how timing changes over the duration of a long running process? Again for an image processing application that runs for 7 days, how do I get a timeseries of profiling measurements so I can see how performance changes over time? For example, do image reads take longer on day 7 hour 24 than they did on day 1 hour 1?
Documentation suggestion:
For my usage, I need to run multiple processes simulateneously , and I need to have each process have a unique output path. Of course, with timemory this is possible, but it took me a little while to figure out how. Here is my summary code snippet. Is there a place in the documentation to put a tip like this? I would like to contribute if it is useful and I know where to put it.
Thanks again for your help!