Avoid calling runtime.GoroutineProfile() twice for every profiling
sample by reusing a dynamically growing slice.
On my machine this reduces the average time the world is stopped from 50
µsec to 25 µsec per sample. But whether or not this translates into 2x
overhead reduction in the realword will require further testing.
Additionally this patch uses a 100x more efficient approach for counting
the stacks. The old approach was taking 8 µsec per aggregation, the new
one takes about 90 ns.
Avoid calling runtime.GoroutineProfile() twice for every profiling sample by reusing a dynamically growing slice.
On my machine this reduces the average time the world is stopped from 50 µsec to 25 µsec per sample. But whether or not this translates into 2x overhead reduction in the realword will require further testing.
Additionally this patch uses a 100x more efficient approach for counting the stacks. The old approach was taking 8 µsec per aggregation, the new one takes about 90 ns.