improve worst-case memory tracking performance

gbtitus commented 6 years ago

Summary of Problem

See #10396, specifically this comment, which points out a case where memory tracking increases execution time by 100X.

The record-keeping for memory tracking adds only a marginal cost to the already expensive operations of allocation and deallocation. But as a side effect of concurrency control on its internal data structures it also effectively serializes those operations within each node. This can slow overall performance by a much greater amount than the actual tracking does. Looking at the code, there are a number of aspects of it that could lead to poor worst-case performance:

The mutual exclusion protects operations that aren't on shared data, such as getting table entries from the system allocator, that don't need such protection.
The concurrency control is done entirely by mutual exclusion on a single shared data structure. There is no conflict avoidance, such as for example by maintaining multiple hash tables.
The cost of getting tale entries from the system allocator is fully visible. There is not attempt to amortize this by, for example, bulk allocation or caching.
The hash table is shrunk much less aggressively than it is grown, but shrinking it may not be a good idea anyway. Doing so saves some space but not very much in the grand scheme of things, and both the shrink itself and any future re-growth that results are definitely expensive. The space savings doesn't seem worth the time cost.

This seems like an area where a few days work could improve worst-case performance by a lot, which would make memory tracking practical to use in more situations.

Steps to Reproduce

Source Code:

From the Chapel test suite: test/studies/shootout/submitted/binarytrees.chpl

Compile command:

chpl --fast ...

Execution command:

time ./a.out --n=18 on a lightly loaded 24-core Linux system showed about 0.5 sec of user time without --memTrack and over 50 secs with it. Reducing the n value will produce faster runs but still show the effect though less dramatically. For example, with n=16 the user times were about 0.15s and 9s, respectively.

Configuration Information

Output of chpl --version:

chpl version 1.18.0 pre-release (0845395142)

Output of $CHPL_HOME/util/printchplenv --anonymize:

CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: intrinsics
CHPL_GMP: none
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_AUX_FILESYS: none

Back-end compiler and version, e.g. gcc --version or clang --version:
```
gcc (GCC) 6.2.0
```

gbtitus commented 6 years ago

Having created this issue I'm now assigning it to myself, because I've already started doing a little work on it in my spare time.

gbtitus commented 2 years ago

From a practical standpoint https://github.com/chapel-lang/chapel/pull/18465 helps with what users have been experiencing, but as of the date of this comment the specific case of release/examples/benchmarks/shootout/binarytrees still shows a ~100x increase in user time with --memTrack. So, I'm removing the "spike:" prefix from the title and removing myself as an assignee, but I don't think this should be closed yet. (Unless we won't fix it, but I'll let others decide on that.)

chapel-lang / chapel