incomplete (or unavailable) previous stack traces

alx01011 commented 4 weeks ago

Currently MaTSa relies on a per thread history buffer that records each field access and function enter/exit event. Currently the size is configurable via the environment variable MATSA_HISTORY.

Unfortunately the buffer can fill up really fast, especially when a field is accessed in a loop with many iterations. This became a problem in some benchmarks because the previous trace was missing.

The solution I propose involves keeping a ring buffer of history buffers (and an index to indicate the current used buffer). Each history buffer will have a configurable size and an epoch. In case of overflow, we increment the index of the current active buffer in the larger ring buffer. This will move us to the next history buffer, in that case we have to overwrite what was previously stored and increment its epoch.

On each access I propose storing 8 additional buffers in shadow, the layout could look like this (can be configurable):

16 bits : BCI
16 bits : Ring buffer index
16 bits : History buffer epoch
16 bits : History buffer index

Where:

BCI, represent the bytecode index of the current access.
Ring buffer index represents the index of the history buffer in the larger ring buffer
History buffer epoch represents the state (epoch or counter) of the history buffer at the time of the store. In case of a mismatch it means that, that specific buffer was overwritten.
History buffer index represents the index of the history buffer in which the access was recorder at. This is useful for stack formulation. Essentially we traverse that buffer up to that index and formulate a stack by pushing function enter events and popping function exit events.

Why not use a single large history buffer instead? Well, mainly because in case the ring buffer wraps around, it will takes us to the beginning, meaning younger accesses (in later buffers) will remain for a little longer. Another argument is that we are keeping it smaller and it is faster to traverse and formulate the pseudo-stack.

Future work:

Evicted history buffers can live off-heap (in a fast device) and on epoch mismatch we can look them up. On each store on the device we would also need to store the epoch, ring buffer index and TID.
Reporting can be done by another thread which will be running in the background and awaits to be notified. That thread can then access the history buffers (on and off heap) and produce a trace without disturbing the application execution.

alx01011 commented 4 weeks ago

@zakkak Any thoughts on this? Or anything better to propose?

zakkak commented 3 weeks ago

Do I get it right that the ring buffer will be per thread as well?

If so, why not using a single circular buffer?

alx01011 commented 3 weeks ago

Do I get it right that the ring buffer will be per thread as well?

If so, why not using a single circular buffer?

@zakkak Yes you are right. The problem with the single circular buffer is accuracy. From section Previous Stack Traces of the paper:

If the buffer reaches maximum capacity, we set the current index to zero, practically invaliding the whole buffer to avoid inaccurate traces. We cannot use it as a circular buffer because newer function enter events may appear before older ones, making the trace inaccurate.

Also, what I failed to mention is that when we move to the next history buffer we copy the current thread's stack (so we get all the function enter events up until now).

zakkak commented 3 weeks ago

We cannot use it as a circular buffer because newer function enter events may appear before older ones, making the trace inaccurate.

I still don't get it. They might appear before them if you don't treat it as a circular buffer.

To give an example let's assume a buffer of size 10. When you start tracing events you begin the index at 0 and increase each time an event occurs. When you reach 9 what you do is reset the whole buffer and the index. What I suggest is that when you reach 9 you reset the index to 0 but treat anything after 0 as a past event. So after some more events when the index is at 6 and you want to trace an event at 3 you treat anything between 7 and 3 (wrapping around at 9 as the past and anything between 4 and 6 as the future (in relation to the event in 3.

alx01011 commented 3 weeks ago

We could still miss function entry events if they are at the beginning of the buffer (e.g thread.start() or something similar). So no accuracy loss but some loss of information. I am not sure what approach we should follow. In addition, newer accesses may still preserve their traces in case of overflow while we move to a history buffer that comes before the current one. (maybe a figure will be better here)

zakkak commented 3 weeks ago

(maybe a figure will be better here)

Yes a figure would help a lot, but before putting the effort on drawing it, it might be better to have a quick call first. Let's try tomorrow morning if you are available.

alx01011 commented 3 weeks ago

(maybe a figure will be better here)

Yes a figure would help a lot, but before putting the effort on drawing it, it might be better to have a quick call first. Let's try tomorrow morning if you are available.

Sure lets talk at 10 or later, whatever works for you best 🙂

alx01011 commented 3 weeks ago

So I just finished a draft for my proposal along with your suggestion for the extra pointer for the current stack. I have tested some of the dacapo benchmarks. For some I had missing traces but managed to resolve that by increasing the size of the ring buffer. The sweet spot seems to 2^12 (4096) buffers with 2^16 (65536) slots each (for func enter/exit events) . Currently there is some overhead at startup due to the allocation. Overall, there seems to be around 5%, especially in large applications.

zakkak commented 3 weeks ago

The sweet spot seems to 2^12 (4096) buffers with 2^16 (65536) slots each

Wow, is that 2GB per thread??? 4096 65536 8 bytes (the slot size). It feels too high (especially considering it's not including the stack traces).

It would be interesting to keep track of how often the ring buffer becomes full, does it happen every 1 second? every 5 seconds? Sooner than that?

Doing some rough guesstimates: the ring buffer is able to track ~270 million instructions. Assuming 1/10th of the instructions are call/ret/get/put (since we are in an interpreter we probably pay quite a lot on bloat around each instruction) to fill the ring buffer you need to execute ~2.7 billion instructions. Assuming a 2.7GHz CPU clock with IPC 1 (although the instructions here are not machine instructions but JVM opcodes) that would mean that you are able to track a time window of ~1 second.

(for func enter/exit events)

and get/put events, no?

[!Note] It is also worth keeping in the back of your mind that when you enable JIT compilation:

all these operations will be happening quite faster (meaning that the ring buffer will fill faster (and GC will probably trigger more often).

the compiler will be able to eliminate some heap allocations, so you will have to track less get/puts.

alx01011 commented 3 weeks ago

The sweet spot seems to 2^12 (4096) buffers with 2^16 (65536) slots each

Wow, is that 2GB per thread??? 4096 65536 8 bytes (the slot size). It feels too high (especially considering it's not including the stack traces).

It would be interesting to keep track of how often the ring buffer becomes full, does it happen every 1 second? every 5 seconds? Sooner than that?

Doing some rough guesstimates: the ring buffer is able to track ~270 million instructions. Assuming 1/10th of the instructions are call/ret/get/put (since we are in an interpreter we probably pay quite a lot on bloat around each instruction) to fill the ring buffer you need to execute ~2.7 billion instructions. Assuming a 2.7GHz CPU clock with IPC 1 (although the instructions here are not machine instructions but JVM opcodes) that would mean that you are able to track a time window of ~1 second.

(for func enter/exit events)

and get/put events, no?

[!Note]

It is also worth keeping in the back of your mind that when you enable JIT compilation:

all these operations will be happening quite faster (meaning that the ring buffer will fill faster (and GC will probably trigger more often).

the compiler will be able to eliminate some heap allocations, so you will have to track less get/puts.

Get/put events are not stored in history, only func enter/exit.

Your math is correct apart from the fact that the buffers are actually allocated with mmap (mem reserved but not entirely allocated), so that means that the maximum size would be around 2GB per thread. I was using the graphchi benchmark for which in the past we missed every previous stack trace. The peak memory usage I measured was around 10GB, which is a bit higher than TSan's (sits at around 8GB).

I wonder if the buffers can be further compressed to reduce the footprint even more.

We could possibly reduce the ring buffer size to 2048 (I am testing this now).

Keep in mind that the issue with missing info only appears in 1-2 benchmarks. For some other, even a size of 8 was sufficient. I think I should provide an extra command line argument to configure the size and stick to a relative small number as default (8 seems okay)

UPDATE:

There was a bug in the code, the ring buffer size along with the event buffer size were increasing at the same rate, meaning their sizes were equal. Having fixed that, I managed to run every benchmark with a maximum of 2^8 ring buffer size, with 2^16 buffer size each.

alx01011 commented 2 weeks ago

@zakkak Since this is mostly resolved, any chance you have some time this week to talk about your suggestion regarding, putting the shadow cells inside each oop (object)?

zakkak commented 2 weeks ago

@zakkak Since this is mostly resolved, any chance you have some time this week to talk about your suggestion regarding, putting the shadow cells inside each oop (object)?

Let's try next week. This week was a bit busy.

alx01011 / matsa

incomplete (or unavailable) previous stack traces #23