DARMA-tasking / vt

DARMA/vt => Virtual Transport
Other
35 stars 8 forks source link

Create LDMS stream publish for phase data #2183

Open lifflander opened 1 year ago

lifflander commented 1 year ago

https://ovis-hpcreadthedocs.readthedocs.io/en/latest/ldms-streams.html#how-to-make-a-data-connector

You'll need to include/import the following files: ldms.h, ldmsd_stream.h, util.h For example, in C++ code, add the following:

#include <ldms/ldms.h> 
#include <ldms/ldmsd_stream.h>
#include <ovis_util/util.h>

You'll also need the object: ldms_t* ldms

The function you'll need to add for publishing messages is:

ldmsd_stream_publish( (*ldms), <NAME_OF_SCHEMA>, <TYPE_OF_MSG>, <MSG_OBJECT>)

So, for example, if someone wanted to send Kokkos data as a JSON to their database, the function would look like this:

ldmsd_stream_publish( (*ldms), "kokkos-perf-data", LDMSD_STREAM_JSON,

PhilMiller commented 1 year ago

Title should read LDMS?

PhilMiller commented 1 year ago

Also, vt's internal diagnostics seem to be perfect for feeding out to LDMS.

Snell1224 commented 1 year ago

The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector

lifflander commented 1 year ago

The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector

I'm getting this compile-time error when I just include the three files listed above:

root@b86898199925:/build/vt# ninja
[1/2] Building CXX object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
FAILED: examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
/usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -I/build/vt/lib/checkpoint/src -I/vt/lib/checkpoint/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -O3 -DNDEBUG -fdiagnostics-color=always -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -Werror -std=c++17 -MD -MT examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -MF examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o.d -o examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -c /vt/examples/hello_world/hello_world.cc
In file included from /usr/local/include/ldms/ldmsd_stream.h:6,
                 from /vt/examples/hello_world/hello_world.cc:47:
/usr/local/include/ldms/ldms_xprt.h:401:26: error: declaration of 'void (* ldms_xprt::app_ctxt_free_fn)(void*)' changes meaning of 'app_ctxt_free_fn' [-fpermissive]
  401 |         app_ctxt_free_fn app_ctxt_free_fn;
      |                          ^~~~~~~~~~~~~~~~
In file included from /vt/examples/hello_world/hello_world.cc:46:
/usr/local/include/ldms/ldms.h:649:16: note: 'app_ctxt_free_fn' declared here as 'typedef void (* app_ctxt_free_fn)(void*)'
  649 | typedef void (*app_ctxt_free_fn)(void *ctxt);
      |                ^~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.

This is the script I used to install LDMS in the container:

https://github.com/DARMA-tasking/vt/blob/2183-create-ldma-stream-publish-for-phase-data/ci/deps/ldms.sh

Snell1224 commented 1 year ago

Can you please try to run the following and see if the issue still occurs? cd ovis ./autogen.sh ./packaging/make-all-top.sh

I've never encountered this kind of error and usually use the "make-all-top.sh" to build LDMS. This script automatically configures LDMS with the common flags that our team uses (build is under .../ovis/LDMS_install).

In the meantime, I'm going to reach out others who are more experienced with this kind of error.

Snell1224 commented 1 year ago

UPDATE: What version of LDMS is being installed and what is the output of g++ --version of the container?

JacobDomagala commented 11 months ago

I was able to successfully build that LDMS and test it with vt (locally). Next I'll try to do the same within our Docker containers.

JacobDomagala commented 11 months ago

The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector

I'm getting this compile-time error when I just include the three files listed above:

root@b86898199925:/build/vt# ninja
[1/2] Building CXX object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
FAILED: examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
/usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -I/build/vt/lib/checkpoint/src -I/vt/lib/checkpoint/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -O3 -DNDEBUG -fdiagnostics-color=always -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -Werror -std=c++17 -MD -MT examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -MF examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o.d -o examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -c /vt/examples/hello_world/hello_world.cc
In file included from /usr/local/include/ldms/ldmsd_stream.h:6,
                 from /vt/examples/hello_world/hello_world.cc:47:
/usr/local/include/ldms/ldms_xprt.h:401:26: error: declaration of 'void (* ldms_xprt::app_ctxt_free_fn)(void*)' changes meaning of 'app_ctxt_free_fn' [-fpermissive]
  401 |         app_ctxt_free_fn app_ctxt_free_fn;
      |                          ^~~~~~~~~~~~~~~~
In file included from /vt/examples/hello_world/hello_world.cc:46:
/usr/local/include/ldms/ldms.h:649:16: note: 'app_ctxt_free_fn' declared here as 'typedef void (* app_ctxt_free_fn)(void*)'
  649 | typedef void (*app_ctxt_free_fn)(void *ctxt);
      |                ^~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.

This is the script I used to install LDMS in the container:

https://github.com/DARMA-tasking/vt/blob/2183-create-ldma-stream-publish-for-phase-data/ci/deps/ldms.sh

I get the same error when using 4.3.11 version (or older). Issue is no longer present when using OVIS-4 branch source code.

Snell1224 commented 11 months ago

The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector

I'm getting this compile-time error when I just include the three files listed above:

root@b86898199925:/build/vt# ninja
[1/2] Building CXX object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
FAILED: examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
/usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -I/build/vt/lib/checkpoint/src -I/vt/lib/checkpoint/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -O3 -DNDEBUG -fdiagnostics-color=always -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -Werror -std=c++17 -MD -MT examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -MF examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o.d -o examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -c /vt/examples/hello_world/hello_world.cc
In file included from /usr/local/include/ldms/ldmsd_stream.h:6,
                 from /vt/examples/hello_world/hello_world.cc:47:
/usr/local/include/ldms/ldms_xprt.h:401:26: error: declaration of 'void (* ldms_xprt::app_ctxt_free_fn)(void*)' changes meaning of 'app_ctxt_free_fn' [-fpermissive]
  401 |         app_ctxt_free_fn app_ctxt_free_fn;
      |                          ^~~~~~~~~~~~~~~~
In file included from /vt/examples/hello_world/hello_world.cc:46:
/usr/local/include/ldms/ldms.h:649:16: note: 'app_ctxt_free_fn' declared here as 'typedef void (* app_ctxt_free_fn)(void*)'
  649 | typedef void (*app_ctxt_free_fn)(void *ctxt);
      |                ^~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.

This is the script I used to install LDMS in the container: https://github.com/DARMA-tasking/vt/blob/2183-create-ldma-stream-publish-for-phase-data/ci/deps/ldms.sh

I get the same error when using 4.3.11 version (or older). Issue is no longer present when using OVIS-4 branch source code.

@JacobDomagala Thank you catching this and letting me know. The LDMS team and I will look into it. Feel free to reach out if you come across any more issues!

lifflander commented 10 months ago

@Snell1224 @vsurjadidjaja

Here is a screenshot of the form the data will take from our current JSON statistics file. This data will be incrementally submitted phase-by-phase as the data is computed. A phase is roughly equivalent to a timestep in an application. After a phase runs, the load balancer might be run depending on the configuration. Thus, we always have pre-LB statistics and we might have a migration count and post-LB statistics depending on whether it ran or not.

Screenshot 2023-11-08 at 12 45 07

So after each phase completes, we will submit this:

{
  "id": 4,    // A unique phase ID
  "ts": 40.0, // The timestamp
  "migration count": 1, // number of migrations [optional]
  "pre-LB": {
     "Object_comm": { },
     "Object_load_modeled": { },
     "Object_load_raw": { },
     "Rank_comm": { },
     "Rank_load_modeled": { },
     "Rank_load_raw": { }
  },
  "post-LB": { // [optional]
     // Same as pre-LB
  }
}

Each one of the keys (Object_comm, Object_load_modeled, ...) in pre- and post-LB will include the following statistics:

"avg": 7190.222222222223, // mean
"car": 9.0, // cardinality
"imb": 0.2739522808752626, // imbalance (max/avg-1)
"kur": -1.7815486080524885, // kurtosis
"max": 9160.0, // maximum 
"min": 5880.0, // minimum
"npr": 9.0,
"skw": 0.515228148796637, // skewness
"std": 1310.1329733467003, // standard deviation
"sum": 64712.0, // sum
"var": 1716448.4078502655 // variance

For the stream publish key, I propose "vtLBStats".

lifflander commented 10 months ago

@Snell1224 @vsurjadidjaja I'm a little confused as to how I should convert the output of gettime() to be consistent with what you need.

Snell1224 commented 10 months ago

@Snell1224 @vsurjadidjaja I'm a little confused as to how I should convert the output of gettime() to be consistent with what you need.

We use epoch time for analyzing streams data so we send this in the JSON message. As for when to record/get the time, that's more of a preference thing. I'm not too familiar with VT but if you don't need to monitor the start/duration/end time of each phase, then getting the time whenever you send the JSON message will work.

The example below shows what we do for Darshan and how we collect the end time of an I/O event (again this is just a preference):

static inline struct timespec abs_timespec(void)
{
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return(tp);
}

struct timespec tspec_start, tspec_end;

tspec_start = abs_timespec()
// IO stuff happening here
tspec_end = abs_timespec()
// Do other stuff and send message

micro_s = tspec_end.tv_nsec/1.0e3;
sprintf(jb11,"{.....,\"timestamp\":%lu.%.6lu}]}", ....., tspec_end.tv_sec, micro_s);