GitBubble / hi-benchmarks

A light-weight benchmark platform for evaluation the health of distributed ROS.
1 stars 2 forks source link

investigate netdata framework #1

Open GitBubble opened 6 years ago

GitBubble commented 6 years ago

DFX REPORT (PART 1)

Netdata investigation report-> how to monitor our ROS system

netdata is great tool for server monitoring , we try to learn some architecture from it. so here is a brief report to guide our plan in the near future.

1, something interested

  1. netdata collect data on each $host$ just one second behind real status. they mean real-time.

  2. netdata can mirror collected data to designated server. they call real time data replication and mirroring

  3. netdata use a plugin language to add plugin to collect more metrics and can visualize the data as charts if only the item have some countable data.

  4. netdata collect all its metrics locally.

  5. netdata has alarms with pre-configured events and threshhold.

  6. netdata can pipe messages to syslog and various of other tools. such as

(email addresses slack channels discord channels IRC channels pushover pushbullet telegram.org pagerduty twilio messagebird alerta flock twillo kavenegar)

2,the monitor and health alert is great

Monitor is interesting feature of netdata, specially interested in the following 2 area -monitor systemd service is useful -monitor IPMI tools

https://github.com/firehol/netdata/wiki/health-monitoring

3,Run time cost in different architecture

macOS high serria

seems netdata is a not light weight monitor now. A liar !

50 threads, 14 shared libraries...

2018-07-05 00 47 06 2018-07-05 01 20 11
ubuntu 16.04 on virtual machine, x86_64

18 threads, 10 shared libraries:

2018-07-05 01 18 30
Ubuntu 18.04 ARM64:

17 threads, 11 shared libraries

(gdb)
  Id   Target Id         Frame
* 1    Thread 0xffff8e520660 (LWP 76563) "netdata" 0x0000ffff8e420624 in __libc_pause () at ../sysdeps/unix/sysv/linux/pause.c:32
  2    Thread 0xffff8e10f130 (LWP 76586) "netdata" 0x0000ffff8e420728 in __GI___nanosleep (requested_time=0xffff8e10e3b8, remaining=0xffff8e10e3c8)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  3    Thread 0xffff8d90e130 (LWP 76587) "netdata" 0x0000ffff8e420728 in __GI___nanosleep (requested_time=0xffff8d90d528, remaining=0xffff8d90d538)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  4    Thread 0xffff8d10d130 (LWP 76588) "netdata" 0x0000ffff8e420728 in __GI___nanosleep (requested_time=0xffff8d10c198, remaining=0xffff8d10c1a8)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  5    Thread 0xffff8c90c130 (LWP 76589) "netdata" 0x0000ffff8e440ae4 in __GI___libc_read (fd=<optimized out>, buf=0xaaaaccdb30c0, nbytes=4096)
    at ../sysdeps/unix/sysv/linux/read.c:27
  6    Thread 0xffff8c10b130 (LWP 76590) "netdata" 0x0000ffff8e420728 in __GI___nanosleep (requested_time=0xffff8c10a5c8, remaining=0xffff8c10a5d8)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  7    Thread 0xffff8b109130 (LWP 76592) "netdata" 0x0000ffff8e420728 in __GI___nanosleep (requested_time=0xffff8b1084b8, remaining=0xffff8b1084c8)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  8    Thread 0xffff8a908130 (LWP 76593) "netdata" 0x0000ffff8e420728 in __GI___nanosleep (requested_time=0xffff8a907198, remaining=0xffff8a907198)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  9    Thread 0xffff8a107130 (LWP 76594) "netdata" 0x0000ffff8e445048 in __GI___poll (fds=0xaaaabfd9f740, nfds=187650274949376, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  10   Thread 0xffff89906130 (LWP 76596) "netdata" 0x0000ffff8e420728 in __GI___nanosleep (requested_time=0xffff89904158, remaining=0xffff89904168)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  11   Thread 0xffff88904130 (LWP 76601) "netdata" 0x0000ffff8e445048 in __GI___poll (fds=0xaaaabfe33d90, nfds=187650274949376, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  12   Thread 0xffff88103130 (LWP 76604) "netdata" 0x0000ffff8e445048 in __GI___poll (fds=0xaaaabfe31fd0, nfds=187650274949376, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  13   Thread 0xffff86900130 (LWP 76608) "netdata" 0x0000ffff8e445048 in __GI___poll (fds=0xaaaabfe25f30, nfds=187650274949376, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  14   Thread 0xffff8b90a130 (LWP 76612) "netdata" 0x0000ffff8e445048 in __GI___poll (fds=0xaaaabfe4a290, nfds=187650274949376, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  15   Thread 0xffff858fe130 (LWP 76614) "netdata" 0x0000ffff8e445048 in __GI___poll (fds=0xaaaabfe27bb0, nfds=187650274949376, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  16   Thread 0xffff850fd130 (LWP 76615) "netdata" 0x0000ffff8e440ae4 in __GI___libc_read (fd=<optimized out>, buf=0xaaaabfe29670, nbytes=4096)
    at ../sysdeps/unix/sysv/linux/read.c:27
  17   Thread 0xffff848fc130 (LWP 76618) "netdata" 0x0000ffff8e445048 in __GI___poll (fds=0xaaaabfe45a30, nfds=1, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x0000ffff8e39e380  0x0000ffff8e48eba8  Yes         /lib/aarch64-linux-gnu/libc.so.6
0x0000ffff8e2cef40  0x0000ffff8e324d48  Yes         /lib/aarch64-linux-gnu/libm.so.6
0x0000ffff8e299020  0x0000ffff8e2ade94  Yes (*)     /lib/aarch64-linux-gnu/libz.so.1
0x0000ffff8e2816b0  0x0000ffff8e285018  Yes (*)     /lib/aarch64-linux-gnu/libuuid.so.1
0x0000ffff8e259690  0x0000ffff8e267a0c  Yes         /lib/aarch64-linux-gnu/libpthread.so.0
0x0000ffff8e4fb040  0x0000ffff8e511e48  Yes         /lib/ld-linux-aarch64.so.1
0x0000ffff8e23d350  0x0000ffff8e241f0c  Yes         /lib/aarch64-linux-gnu/libnss_compat.so.2
0x0000ffff8e223120  0x0000ffff8e229d28  Yes         /lib/aarch64-linux-gnu/libnss_nis.so.2
0x0000ffff8e2001c0  0x0000ffff8e20b81c  Yes         /lib/aarch64-linux-gnu/libnsl.so.1
0x0000ffff8e1dd400  0x0000ffff8e1e3b54  Yes         /lib/aarch64-linux-gnu/libnss_files.so.2
0x0000ffff8e115fd0  0x0000ffff8e13cc3c  Yes (*)     /lib/aarch64-linux-gnu/libnss_systemd.so.2

4, Summary:

Sources written in a very clear style, but the build system is sophisticated. including autoconf files and CMake with well debugged script. The flexible architecture behind its design philosophy is a plugin system . it create some static thread to run plugin which take advantage of bash/python/systemd/etc utilities to collect statistic locally.it is fast but also linked too many libraries we will not need in ROS system.

5, conclusion:

we can adopt the plugin system ,and some parts of its http-server. but it linked to much burden we don't need. we need to optimize the source to customize the ros scenario. especially, the monitor and alert system is really a good wheel which can be used by us to create hi-benchmarks.

GitBubble commented 6 years ago

DFX REPORT (PART 2)

Netdata investigation report-> to learn how netdata re-invent the wheels

After read the code for one hour or so , I realized netdata is a truely carefully written project.

It construct its sources from scratch with small utility functions where as normal open source project will using of the existed one. You may ask WHY?

netdata is aimed to run on every piece of cheap hardware and platform . that's a common goals with hi-benchmarks. Different system and architecture may have different effect when using the linux headers. some even not POSIX. In order to adapt netdata on these platforms, netdata re-written some system level functions .

the re-invent wheel does helpful in the project. to improve portability and performance. i think it's a one of the most important reasons why netdata is a popular project in github.

for example, in src/inlined.h

static inline char *strncpyz(char *dst, const char *src, size_t n) {
    char * = dst;

    while (*src && n --)
            *dst++ = *src++;
    *dst = '\0';

    return p;

it's a nearly standard implementation, the project consider the underlying environment may even without glibc header files.

compared with glibc implementations : https://github.com/lattera/glibc/blob/master/string/strncpy.c then, we will get a clue that these little tricks in all will make netdata a better performance monitor tool run every scenario.

#ifndef STRNCPY
#define STRNCPY strncpy
#endif

char *
STRNCPY (char *s1, const char *s2, size_t n)
{
  char c;
  char *s = s1;

  --s1;

  if (n >= 4)
    {
      size_t n4 = n >> 2;

      for (;;)
    {
      c = *s2++;
      *++s1 = c;
      if (c == '\0')
        break;
      c = *s2++;
      *++s1 = c;
      if (c == '\0')
        break;
      c = *s2++;
      *++s1 = c;
      if (c == '\0')
        break;
      c = *s2++;
      *++s1 = c;
      if (c == '\0')
        break;
      if (--n4 == 0)
        goto last_chars;
    }
      n = n - (s1 - s) - 1;
      if (n == 0)
    return s;
      goto zero_fill;
    }

 last_chars:
  n &= 3;
  if (n == 0)
    return s;

  do
    {
      c = *s2++;
      *++s1 = c;
      if (--n == 0)
    return s;
    }
  while (c != '\0');

 zero_fill:
  do
    *++s1 = '\0';
  while (--n > 0);

  return s;
}

but one thing confused me is that it declare every function with extern , extern is a default prefix to let compiler know there is a definition somewhere accordingly to expose the function to global wide even you don't add it .

GitBubble commented 6 years ago

DFX REPORT (PART 3)

Netdata investigation report-> to learn its excellent architecture.

apart from the databse section, other mechanism is fairly easy to catch after read the code. fortunately ,the author is very responsive to issue in github.

He describe the general idea of netdata

  1. netdata spawns threads and processes for data collection.
  2. these threads collect information from files in /proc, by connecting to other processes APIs, by executing third party commands, sending queries to databases, application servers, etc. There is no limit on what netdata can do to collect data. We support whatever is needed. There are several APIs involved in this process (including a few netdata specific).
  3. the collected metrics are processed (normalized, interpolated, etc) and stored in a round-robin time-series database with a fixed step.
  4. web dashboards, query this database to retrieve data, which then are passed to javascript visualization libraries to render them on screen.

like this

image

above and below the database there are 2 APIs:

above, is the web API. This gets requests, examines the database and responds in various formats. It also capable to reduce the data (calculate sums, averages, etc). The whole idea is to produce output in the format expected by each visualization library, so many formats are supported.

below, is the data collection API. There is a low-level C API (the internal netdata plugins use this) and a high level external plugins API (this is text based and uses internally the low-level C API).

You can link multiple netdata together, streaming data from one to another. On the sender side the low level C API spawns threads to stream data to another netdata. On the receiver side, the web API spawns threads connecting the sockets from the sending netdata to the external plugins API. Using this mechanism, netdata are able to mirror their data, support proxies, etc.

GitBubble commented 6 years ago

task to be assigned:

1, write python plugin to retrieve node list -> to present all function of rosnode subcommand 2, write python plugin to retrieve topic list -> to present all function of rostopic subcommand 3, write bash/pythc.on plugin to retrieve anything useful for monitor ros nodes. even graph, messages.et

foodtooth commented 6 years ago

Hi, I was digging into netdata code (trying my best because I'm still a newbie in C-World). Finding out the essential part should be the web server, plugin system and DB storage (considering postgre, or mongo, or redis), the mini-base-netdata for our hi-benchmarks should be happening in a day or two.