Open jay-sridharan opened 1 year ago
I tried increasing BL_TIME_QSAMPLES
to 200, which significantly reduces the frequency of the error, but still sometimes fails.
jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
first: 1
last: 199
00000101620.275182859[error]Hello malc, testing 1, 2, 3.0
jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
first: 42
last: 199
00000101623.368342374[error]Hello malc, testing 1, 2, 3.0
jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
first: 50
last: 199
00000101629.213781522[error]Hello malc, testing 1, 2, 3.0
jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
terminate called after throwing an instance of 'malcpp::exception'
what(): Unable to create malc instance
Aborted (core dumped)
I suppose my system seems to have a higher "warm-up time" to bring the latency of clock_gettime
down.
https://stackoverflow.com/questions/53252050/why-does-the-call-latency-on-clock-gettimeclock-realtime-vary-so-much
Perhaps the library can just be configured to take more samples - maybe 1000?
Thanks for debugging.
I don't have time now to check, but that code seems to be doing some heuristics to detect clock latencies. On a quick test I couldn't reproduce.
What seems obvious is that some heuristics there are flaky on your system.
This is old code, so I don't even myself remember what it does :)
A quick read of the code it seems that it first tries to calculate the "clock_gettime" latency.
Then with those same measurements it tries to calculate the time of one tick of the "rdtsc" instruction.
The most common method I have seen is to sleep an amount of time big enough so that is magnitudes bigger than "clock_gettime"'s latency and a context switch and to measure how long was the sleep both in monotonic time and into CPU clocks (rdtsc). This has the drawback that the amount of time to sleep is usually on the second range.
To avoid that, this code there tries to get (reused from the clock_gettime latency calculation to be faster) two measurements that are separated as much as possible and that are not suspicious of having a context switch in between the CPU timer measurements (rdtsc).
The criteria used is that anything above "BL_TIME_CONTEXT_SWITCH_FACTOR" (5) times bigger than "timept64_latency" is considered as a having a context switch between both CPU clock measurements, so you are probably into something with the clock_gettime variability link.
You could probably look at the variability you get on your machine.
Notice that you should be able to override all these preprocessor variables from the command line without changing the sources:
https://mesonbuild.com/Builtin-options.html#compiler-options
So this can probably work (untested):
meson -Dc_args="-DBL_TIME_CONTEXT_SWITCH_FACTOR=10". -Dcpp_args="-DBL_TIME_CONTEXT_SWITCH_FACTOR=10" ...
I have to document these variables properly if I find some time. Ideas for less flaky and fast methods to do this are welcome.
Interesting! I've been looking deeper into this timing code... Specifically this section in malc.c
err = deserializer_execute(
&l->ds,
((u8*) n) + sizeof *n,
((u8*) n) + (slots * l->mem.cfg.slot_size),
n->info.has_timestamp,
l->alloc
);
if (!err.own) {
log_entry le = deserializer_get_log_entry (&l->ds);
malc_log_strings strs;
bl_err entry_err = entry_parser_get_log_strings (&l->ep, &le, &strs);
It looks like the timestamp for the each log entry gets set in either serializer_init
or in deserializer_execute
(depending on if timestamping happens on the producer or consumer side.) Both of those functions end up calling bl_fast_timept_get_fast
which on my system calls the __rdtsc
intrinsic. Everything makes sense so far.
Later, in the above code snippet, deserializer_get_log_entry
is called, and copies the result of __rdtsc
from the deserializer object (l->ds
) to the log_entry object (le
), which gets passed to entry_parser_get_log_strings
where the following lines are executed:
snprintf(
ep->timestamp,
TSTAMP_INTEGER + 2,
"%0" bl_pp_to_str (TSTAMP_INTEGER) FMT_U64 ".",
e->timestamp / bl_nsec_in_sec
);
snprintf(
&ep->timestamp[TSTAMP_INTEGER + 1],
TSTAMP_DECIMAL + 1,
"%0" bl_pp_to_str (TSTAMP_DECIMAL) FMT_U64,
e->timestamp % bl_nsec_in_sec
);
It seems like the current code treats this timestamp (read from __rdtsc
) as if it were in units of nanoseconds, but that is not the case.. For Intel Architecture, the manual states the following:
Which I have confirmed on my system both using that calculation, and grepping the kernel logs to find the same.
jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c (master)$ cpuid -1 | grep -A2 0x15
Time Stamp Counter/Core Crystal Clock Information (0x15):
TSC/clock ratio = 166/2
nominal core crystal clock = 38400000 Hz
38400000 hertz * (166 / 2) = 3.1872 gigahertz
jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c (master)$ sudo journalctl --boot | grep 'kernel: tsc:' -i | cut -d' ' -f5-
kernel: tsc: Detected 3200.000 MHz processor
kernel: tsc: Detected 3187.200 MHz TSC
This is to say, I think that code should be taking into account the TSC frequency.
================
There is also the issue of validity in even using TSC in the first place to measure time. In older architectures, it seems like the TSC was simply incremented on each tick (which can be impacted by CPUs going into sleep states or stopping altogether), but newer processors come with Invariant TSC, described in the Intel manual here:
This can also be checked with /proc/cpuinfo
and cpuid
- the existence of the constant_tsc
and nonstop_tsc
flags imply invariant tsc. This also assumes that the TSC of all cores are synchronized on boot, which can be checked in /sys/devices/system/clocksource
On my platform, these assumptions hold true but it may be a good idea to validate these assumptions in the library as well. This link has a nice overview of the gotchas around using rdtsc
I haven't dug into how this might be best resolved - I see that time.h
declares a few functions like bl_timept_get_freq
which are called in functions declared timept64_funcs_arbitrary_base.h
.
Perhaps the timepoint_posix.h
implementation needs to base off ARBITRARY_BASE instead of NANOSECOND_BASE ? Would really appreciate a pointer here!
Notice the "bl_fast_timept_to_nsec" call on "src/malc/serialization.c:510". It is translated afterwards.
If my memory serves me right on all modern enough Intel processors the clock runs at a constant frequency and is synchronized. If a C library for using the CPU counters existed it would be definitely much better to use it. Akin to what Windows QueryPerformanceCounter/Frequency does. Writing it myself would have been too much of an effort, especially for testing (QEMU could have probably helped though).
Notice that to allow to fixing drift (if it's problematic) and to translate to monotonic to calendar time at postprocessing time, each rotated log file contains the EPOCH and the value of the CPU clock in on each file name.
I see! I missed that section after the #ifdef - thank you! This makes more sense. I am not sure if drift will be an issue... I don't know how much the TSC will drift over time.
I was looking to patch the library so that the timestamps would automatically be in wall-clock time, to remove the need for further post processing (I don't need that much performance).
I ran the test code below and noticed that the conversion provided by bl_cpu_timept_to_sysclock64_diff_ns
was not quite right. The conversion from TSC showed my time about 20 seconds in the future.
struct timespec ts;
char buffer[256];
size_t nbytes;
struct tm* t;
struct tm* t_tsc;
printf("Date Command:\n");
system("date");
bl_timept64 tsc_ns = bl_fast_timept_to_nsec(bl_fast_timept_get());
clock_gettime (CLOCK_REALTIME, &ts);
printf("SysClock Conversion:\n");
t =localtime(&ts.tv_sec);
nbytes = strftime(buffer, 256, "%a %b %e %I:%M:%S.", t);
snprintf(buffer + nbytes, 256 - nbytes, "%.9ld", ts.tv_nsec);
printf("%s\n", buffer);
printf("TSC Conversion:\n");
bl_timept64 adj_time = tsc_ns + bl_cpu_timept_to_sysclock64_diff_ns();
time_t epoch = adj_time / bl_nsec_in_sec;
time_t nanos = adj_time % bl_nsec_in_sec;
t_tsc = localtime(&epoch);
nbytes = strftime(buffer, 256, "%a %b %e %I:%M:%S.", t_tsc);
snprintf(buffer + nbytes, 256 - nbytes, "%.9ld", nanos);
printf("%s\n", buffer);
I needed to modify bl_cpu_timept_to_sysclock64_diff_ns
to subtract state.syncdata.tcpu_to_timept64_ns
instead of add for it to work correctly. Though since clock_gettime(MONOTONIC)
can have an arbitrary base, perhaps some more robust checks on which is greater is necessary, so that it can be properly added or subtracted.
I am attaching a patch file that adds a test to show the discrepancy, and also split up the bl_get_timept64_to_sysclock64_diff_ns
function so that the calibration routine is not executed on every call. Let me know what you think! There are some other minor formatting changes I made for myself as well, but not relevant to this discussion - please ignore.
The way I'd go with this if I were to code it myself would be adding a meson option. I have this project in maintenance-only, so I code no features, even though I accept MRs.
You can modify like the diff below and pass -DBL_NO_CPU_TIMEPT to the compilation I guess. That will make the "fast" clock to fallback on the monotonic clock instead of rdtsc (untested).
diff --git a/include/bl/time_extras/time_extras.h b/include/bl/time_extras/time_extras.h
index a89874a..72d977e 100644
--- a/include/bl/time_extras/time_extras.h
+++ b/include/bl/time_extras/time_extras.h
@@ -10,8 +10,12 @@
#define BL_CPU_TIMEPT_UNAVAILABLE 0
#define BL_CPU_TIMEPT_RDTSC 1
/*----------------------------------------------------------------------------*/
-#if BL_ARCH_IS_X64_X86_IA64
- #define BL_CPU_TIMEPT BL_CPU_TIMEPT_RDTSC
+#ifndef BL_NO_CPU_TIMEPT
+ #if BL_ARCH_IS_X64_X86_IA64
+ #define BL_CPU_TIMEPT BL_CPU_TIMEPT_RDTSC
+ #else
+ #define BL_CPU_TIMEPT BL_CPU_TIMEPT_UNAVAILABLE
+ #endif
#else
#define BL_CPU_TIMEPT BL_CPU_TIMEPT_UNAVAILABLE
#endif
Getting this error when I build and run
malcpp/hello-malcpp
. The error does not happen every time - it seems to be somewhat random when running directly on the terminal, but doesn't ever seem to show up when running in GDB.Program Output
Stack trace from the core dump:
Inspecting Frame 9:
Adding some debugging code, the issue seems to be here in
time_extras.c
Things are running okay when running with GDB. Here is the variable values:
But if I print the same from the executable itself, when it fails it prints:
So for some reason, these
first
andlast
variables are not properly conforming to the logic when not running through GDB. Not entirely sure where to proceed with debugging here - I am not fully understanding the intention of this section of code anyway. Any direction is appreciated!Jay