Open dsharlet opened 1 week ago
I'm not sure this really makes sense. Here's a profile of on a trivial loop of mutex lock, unlock on a single thread:
56.53% benchmark [vdso] [.] __vdso_clock_gettime
19.92% benchmark libc.so.6 [.] __memmove_avx_unaligned_erms
5.22% benchmark pthread_trace.so [.] (anonymous namespace)::thread_state::write_end
4.54% benchmark libstdc++.so.6.0.30 [.] std::chrono::_V2::system_clock::now
2.80% benchmark pthread_trace.so [.] (anonymous namespace)::thread_state::write_begin_with_delta<2ul, (
2.17% benchmark pthread_trace.so [.] (anonymous namespace)::thread_state::write_begin<(anonymous namesp
1.96% benchmark ld-linux-x86-64.so.2 [.] __tls_get_addr
1.41% benchmark libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
0.69% benchmark libc.so.6 [.] clock_gettime@@GLIBC_2.17
0.67% benchmark libc.so.6 [.] pthread_mutex_unlock@@GLIBC_2.2.5
0.67% benchmark pthread_trace.so [.] pthread_mutex_unlock
0.64% benchmark libstdc++.so.6.0.30 [.] 0x000000000009eb10
0.58% benchmark pthread_trace.so [.] pthread_mutex_lock
So it seems like at most a ~30% improvement is on the table. That's probably not worth a lot of added complexity...
Instead of encoding a protobuf directly into the thread local buffer, we could just record a simple struct of events, and generate the protobuf when flushing to the file.
This would reduce overhead in the tracing functions, but would cause flushes to be slower instead. There are pros and cons to this.