Open alx01011 opened 4 weeks ago
@zakkak Any thoughts on this? Or anything better to propose?
Do I get it right that the ring buffer will be per thread as well?
If so, why not using a single circular buffer?
Do I get it right that the ring buffer will be per thread as well?
If so, why not using a single circular buffer?
@zakkak Yes you are right. The problem with the single circular buffer is accuracy. From section Previous Stack Traces of the paper:
If the buffer reaches maximum capacity, we set the current index to zero, practically invaliding the whole buffer to avoid inaccurate traces. We cannot use it as a circular buffer because newer function enter events may appear before older ones, making the trace inaccurate.
Also, what I failed to mention is that when we move to the next history buffer we copy the current thread's stack (so we get all the function enter events up until now).
We cannot use it as a circular buffer because newer function enter events may appear before older ones, making the trace inaccurate.
I still don't get it. They might appear before them if you don't treat it as a circular buffer.
To give an example let's assume a buffer of size 10. When you start tracing events you begin the index at 0 and increase each time an event occurs. When you reach 9 what you do is reset the whole buffer and the index. What I suggest is that when you reach 9 you reset the index to 0 but treat anything after 0 as a past event. So after some more events when the index is at 6
and you want to trace an event at 3
you treat anything between 7
and 3
(wrapping around at 9
as the past and anything between 4
and 6
as the future
(in relation to the event in 3
.
We could still miss function entry events if they are at the beginning of the buffer (e.g thread.start() or something similar). So no accuracy loss but some loss of information. I am not sure what approach we should follow. In addition, newer accesses may still preserve their traces in case of overflow while we move to a history buffer that comes before the current one. (maybe a figure will be better here)
(maybe a figure will be better here)
Yes a figure would help a lot, but before putting the effort on drawing it, it might be better to have a quick call first. Let's try tomorrow morning if you are available.
(maybe a figure will be better here)
Yes a figure would help a lot, but before putting the effort on drawing it, it might be better to have a quick call first. Let's try tomorrow morning if you are available.
Sure lets talk at 10 or later, whatever works for you best 🙂
So I just finished a draft for my proposal along with your suggestion for the extra pointer for the current stack. I have tested some of the dacapo benchmarks. For some I had missing traces but managed to resolve that by increasing the size of the ring buffer. The sweet spot seems to 2^12 (4096) buffers with 2^16 (65536) slots each (for func enter/exit events) . Currently there is some overhead at startup due to the allocation. Overall, there seems to be around 5%, especially in large applications.
The sweet spot seems to 2^12 (4096) buffers with 2^16 (65536) slots each
Wow, is that 2GB per thread??? 4096 65536 8 bytes (the slot size). It feels too high (especially considering it's not including the stack traces).
It would be interesting to keep track of how often the ring buffer becomes full, does it happen every 1 second? every 5 seconds? Sooner than that?
Doing some rough guesstimates: the ring buffer is able to track ~270 million instructions. Assuming 1/10th of the instructions are call/ret/get/put (since we are in an interpreter we probably pay quite a lot on bloat around each instruction) to fill the ring buffer you need to execute ~2.7 billion instructions. Assuming a 2.7GHz CPU clock with IPC 1 (although the instructions here are not machine instructions but JVM opcodes) that would mean that you are able to track a time window of ~1 second.
(for func enter/exit events)
and get/put events, no?
[!Note] It is also worth keeping in the back of your mind that when you enable JIT compilation:
- all these operations will be happening quite faster (meaning that the ring buffer will fill faster (and GC will probably trigger more often).
- the compiler will be able to eliminate some heap allocations, so you will have to track less get/puts.
The sweet spot seems to 2^12 (4096) buffers with 2^16 (65536) slots each
Wow, is that 2GB per thread??? 4096 65536 8 bytes (the slot size). It feels too high (especially considering it's not including the stack traces).
It would be interesting to keep track of how often the ring buffer becomes full, does it happen every 1 second? every 5 seconds? Sooner than that?
Doing some rough guesstimates: the ring buffer is able to track ~270 million instructions. Assuming 1/10th of the instructions are call/ret/get/put (since we are in an interpreter we probably pay quite a lot on bloat around each instruction) to fill the ring buffer you need to execute ~2.7 billion instructions. Assuming a 2.7GHz CPU clock with IPC 1 (although the instructions here are not machine instructions but JVM opcodes) that would mean that you are able to track a time window of ~1 second.
(for func enter/exit events)
and get/put events, no?
[!Note]
It is also worth keeping in the back of your mind that when you enable JIT compilation:
all these operations will be happening quite faster (meaning that the ring buffer will fill faster (and GC will probably trigger more often).
the compiler will be able to eliminate some heap allocations, so you will have to track less get/puts.
Get/put events are not stored in history, only func enter/exit.
Your math is correct apart from the fact that the buffers are actually allocated with mmap (mem reserved but not entirely allocated), so that means that the maximum size would be around 2GB per thread. I was using the graphchi benchmark for which in the past we missed every previous stack trace. The peak memory usage I measured was around 10GB, which is a bit higher than TSan's (sits at around 8GB).
I wonder if the buffers can be further compressed to reduce the footprint even more.
We could possibly reduce the ring buffer size to 2048 (I am testing this now).
Keep in mind that the issue with missing info only appears in 1-2 benchmarks. For some other, even a size of 8 was sufficient. I think I should provide an extra command line argument to configure the size and stick to a relative small number as default (8 seems okay)
UPDATE:
There was a bug in the code, the ring buffer size along with the event buffer size were increasing at the same rate, meaning their sizes were equal. Having fixed that, I managed to run every benchmark with a maximum of 2^8 ring buffer size, with 2^16 buffer size each.
@zakkak Since this is mostly resolved, any chance you have some time this week to talk about your suggestion regarding, putting the shadow cells inside each oop (object)?
@zakkak Since this is mostly resolved, any chance you have some time this week to talk about your suggestion regarding, putting the shadow cells inside each oop (object)?
Let's try next week. This week was a bit busy.
Currently MaTSa relies on a per thread history buffer that records each field access and function enter/exit event. Currently the size is configurable via the environment variable MATSA_HISTORY.
Unfortunately the buffer can fill up really fast, especially when a field is accessed in a loop with many iterations. This became a problem in some benchmarks because the previous trace was missing.
The solution I propose involves keeping a ring buffer of history buffers (and an index to indicate the current used buffer). Each history buffer will have a configurable size and an epoch. In case of overflow, we increment the index of the current active buffer in the larger ring buffer. This will move us to the next history buffer, in that case we have to overwrite what was previously stored and increment its epoch.
On each access I propose storing 8 additional buffers in shadow, the layout could look like this (can be configurable):
Where:
Why not use a single large history buffer instead? Well, mainly because in case the ring buffer wraps around, it will takes us to the beginning, meaning younger accesses (in later buffers) will remain for a little longer. Another argument is that we are keeping it smaller and it is faster to traverse and formulate the pseudo-stack.
Future work: