brendangregg / FlameGraph

Stack trace visualizer
http://www.brendangregg.com/flamegraphs.html
17.38k stars 1.97k forks source link

Sort Data in-place and consume Node on the fly to use less memory #317

Open Krinkle opened 1 year ago

Krinkle commented 1 year ago

See https://phabricator.wikimedia.org/T315056

Use in-place sort for @Data and consume $Node on the fly to use less memory.

This helped stave off out-of-memory errors on cron scripts that builds svgs from large log files.

Re-creates PR https://github.com/brendangregg/FlameGraph/pull/298/ by @AaronSchulz.

Ref https://github.com/brendangregg/FlameGraph/issues/58.

HarrySidhuz0008 commented 1 year ago

In-place Merge Sort

def merge_sort(arr): if len(arr) > 1: mid = len(arr) // 2 left_half = arr[:mid] right_half = arr[mid:]

    merge_sort(left_half)
    merge_sort(right_half)

    i = j = k = 0

    while i < len(left_half) and j < len(right_half):
        if left_half[i] < right_half[j]:
            arr[k] = left_half[i]
            i += 1
        else:
            arr[k] = right_half[j]
            j += 1
        k += 1

    while i < len(left_half):
        arr[k] = left_half[i]
        i += 1
        k += 1

    while j < len(right_half):
        arr[k] = right_half[j]
        j += 1
        k += 1

Example usage

data = [4, 2, 9, 6, 1, 5, 8, 7, 3] merge_sort(data)

for item in data:

Process and consume each sorted element one by one

print(item)

Sorting data in-place while consuming nodes on the fly is a common approach to efficiently manage large datasets in situations where you want to minimize memory usage. This is often used in scenarios like external sorting (sorting data that doesn't fit entirely in memory) and processing large streams of data. To achieve this, you can use techniques like merge sort or other in-place sorting algorithms. Below is a simplified example using Python to demonstrate this concept:

matthew-olson-intel commented 1 year ago

I measured both the performance and maximum RSS (as reported by GNU time) of this patch against a synthetic dataset, and it seems to reduce maximum RSS by about 2% while increasing runtime by about 5.4%:

config: def
  avg: 5.544 ± 0.043
  max rss: 227.585 ± 0.271
  minor page faults: 36723.600 ± 62893.440
  major page faults: 45.000 ± 0.000
config: free
  avg: 5.848 ± 0.092
  max rss: 223.104 ± 0.310
  minor page faults: 35996.200 ± 175294.560
  major page faults: 45.000 ± 0.000

It may be that these ratios could change significantly with different input; I see in your Phabricator posts that you're using significantly larger input. Is there a way that I can download 2022-08-24.excimer.all.log (or a newer version of it) to try this out?

Krinkle commented 1 year ago

@matthew-olson-intel Yes! The raw data behind the visualisations at https://performance.wikimedia.org/php-profiling/ is published daily at https://performance.wikimedia.org/arclamp/logs/daily/.

The .excimer.all.log suffixed files are the largest ones. I recommend using the ones from yesterday or earlier as the current day is incomplete.

matthew-olson-intel commented 1 year ago

@Krinkle OK, thanks -- I ran another gamut of experiments. These are all 10 iterations per config, using the mean average (runtime in seconds, RSS in MB).

config: def
  avg: 152.025 ± 1.992
  max rss: 5974.135 ± 0.377
  minor page faults: 1502767.100 ± 138277.290
  major page faults: 45.000 ± 0.000
config: free
  avg: 163.438 ± 15.326
  max rss: 3134.559 ± 0.167
  minor page faults: 783530.200 ± 1217912.160
  major page faults: 45.000 ± 0.000

Seems to have changed things quite a lot: a 90% reduction in max RSS, and a 7.5% hit to performance. A clear win for this input, IMO.