Significantly reduce profiling overhead

Avoid calling runtime.GoroutineProfile() twice for every profiling sample by reusing a dynamically growing slice.

On my machine this reduces the average time the world is stopped from 50 µsec to 25 µsec per sample. But whether or not this translates into 2x overhead reduction in the realword will require further testing.

Additionally this patch uses a 100x more efficient approach for counting the stacks. The old approach was taking 8 µsec per aggregation, the new one takes about 90 ns.

felixge / fgprof

Significantly reduce profiling overhead #6