Store events in thread local buffer in simple format, convert to proto when flushing

I'm not sure this really makes sense. Here's a profile of on a trivial loop of mutex lock, unlock on a single thread:

  56.53%  benchmark  [vdso]                [.] __vdso_clock_gettime                                              
  19.92%  benchmark  libc.so.6             [.] __memmove_avx_unaligned_erms                                      
   5.22%  benchmark  pthread_trace.so      [.] (anonymous namespace)::thread_state::write_end                    
   4.54%  benchmark  libstdc++.so.6.0.30   [.] std::chrono::_V2::system_clock::now                               
   2.80%  benchmark  pthread_trace.so      [.] (anonymous namespace)::thread_state::write_begin_with_delta<2ul, (
   2.17%  benchmark  pthread_trace.so      [.] (anonymous namespace)::thread_state::write_begin<(anonymous namesp
   1.96%  benchmark  ld-linux-x86-64.so.2  [.] __tls_get_addr                                                    
   1.41%  benchmark  libc.so.6             [.] pthread_mutex_lock@@GLIBC_2.2.5                                   
   0.69%  benchmark  libc.so.6             [.] clock_gettime@@GLIBC_2.17                                         
   0.67%  benchmark  libc.so.6             [.] pthread_mutex_unlock@@GLIBC_2.2.5                                 
   0.67%  benchmark  pthread_trace.so      [.] pthread_mutex_unlock                                              
   0.64%  benchmark  libstdc++.so.6.0.30   [.] 0x000000000009eb10                                                
   0.58%  benchmark  pthread_trace.so      [.] pthread_mutex_lock

I don't think there's much room to improve reading the clock
The memcpy is mostly copying from the thread local buffer to the global circular buffer

So it seems like at most a ~30% improvement is on the table. That's probably not worth a lot of added complexity...

dsharlet / pthread_trace

Store events in thread local buffer in simple format, convert to proto when flushing #3