Poor present table performance

Quuxplusone commented 4 years ago


Bugzilla Link	PR46107
Status	NEW
Importance	P enhancement
Reported by	Christopher Daley (csdaley@lbl.gov)
Reported on	2020-05-27 13:00:50 -0700
Last modified on	2020-07-16 15:11:26 -0700
Version	unspecified
Hardware	PC Linux
CC	csdaley@lbl.gov, jdoerfert@anl.gov, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments	`slow-present-table.c` (4237 bytes, text/x-csrc)
Blocks
Blocked by
See also

Created attachment 23545
The benchmark that reveals slow present table performance

It takes a long time to add new entries to the OpenMP present table and a long
time to access pre-existing entries. I have attached a benchmark that captures
the data management requirements of the HPGMG mini-app. This benchmark shows
that adding new entries with [:0] takes a long time (see "Device init" code
section) and retrieving data takes a long time (see "target update from" code
section).

I have two code versions: code version #1 uses [:0] to attach device pointers,
code version #2 manually attaches device pointers in OpenMP target regions. I
have tested 4 configurations using LLVM/Clang-11 from Apr 9 2020.
Configurations 1 and 2 test the case where the present table is small for both
code versions - the effective bandwidth of the "target update from" directive
is 3.6 and 3.8 GB/s. Configurations 3 and 4 test the case where the present
table can be large for both code versions - the effective bandwidth in
configuration 3 is only 0.2 GB/s! The workaround code in configuration 4
achieves 3.8 GB/s, however, the time to initialize the data structure on the
device is more than 20x slower for both configurations 3 and 4 even though the
total problem size is identical for all 4 configurations. The full data is:

+ clang -Wall -Werror -Ofast -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda slow-
present-table.c -o slow-present-table

Small present table configuration. Both code versions achieve 3.6 and 3.8 GB/s.
+ srun -n 1 ./slow-present-table 100 100 10000 1
num_levels=100, num_blocks=100, block_size=10000, mem=0.745058 GB,
code_version=1
Host init time=0.244111 seconds
Device init time=0.747118 seconds
Device kernel time=0.012566 seconds
Transfers=100, Time=0.208275 seconds, Data=0.745058 GB, Bandwidth=3.577283 GB/s
SUCCESS
+ srun -n 1 ./slow-present-table 100 100 10000 2
num_levels=100, num_blocks=100, block_size=10000, mem=0.745058 GB,
code_version=2
Host init time=0.228604 seconds
Device init time=0.568662 seconds
Device kernel time=0.013114 seconds
Transfers=100, Time=0.198643 seconds, Data=0.745058 GB, Bandwidth=3.750740 GB/s
SUCCESS

Large present table configuration. Code version #1 is an order of magnitude
slower than code version #2 for the data transfer!!!
+ srun -n 1 ./slow-present-table 100 10000 100 1
num_levels=100, num_blocks=10000, block_size=100, mem=0.745058 GB,
code_version=1
Host init time=0.234948 seconds
Device init time=38.181732 seconds
Device kernel time=0.058576 seconds
Transfers=100, Time=3.147222 seconds, Data=0.745058 GB, Bandwidth=0.236735 GB/s
SUCCESS
+ srun -n 1 ./slow-present-table 100 10000 100 2
num_levels=100, num_blocks=10000, block_size=100, mem=0.745058 GB,
code_version=2
Host init time=0.236912 seconds
Device init time=20.666237 seconds
Device kernel time=0.056635 seconds
Transfers=100, Time=0.197634 seconds, Data=0.745058 GB, Bandwidth=3.769888 GB/s
SUCCESS

Personal discussion with Johannes Doerfert: "I looked at the code we run in the
slow version and (without profiling). I suspect the problem is that we have 3
entries for each mapped "ptr[:0]" in the Device.ShadowPtrMap. In the other
version we have one entry *temporarily* in there. At some point, I suspect,
this std::map becomes large and dealing with it slows down. It is unclear if we
need these mappings really or not. If so, we could potentially investigate a
more scalable data structure. I can imagine the init is slow because the map is
build, the update is slow because we iterate the map for each update. Maybe
there is also some overhead we introduce by going through a few dynamic
libraries trying to allocate and copy 0 bytes of data for the map with the
empty array section."

Quuxplusone commented 4 years ago

Attached slow-present-table.c (4237 bytes, text/x-csrc): The benchmark that reveals slow present table performance

Quuxplusone commented 4 years ago

Was this fixed/impacted by https://reviews.llvm.org/D82264 at all?

Quuxplusone commented 4 years ago

(In reply to Johannes Doerfert from comment #1)
> Was this fixed/impacted by https://reviews.llvm.org/D82264 at all?

No. I just tested LLVM/Clang from 7 days ago: clang version 11.0.0
(https://github.com/llvm/llvm-project.git
469da663f2df150629786df3f82c217062924f5e).

When the present table is large, the performance of "pragma omp target update
from" is bad. In the test program attached to the bug report, the poor
effective bandwidth configuration is "./slow-present-table 100 10000 100 1"

All of the time is lost in std::_Rb_tree_increment. The callpath that I see
from my profiler shows "__tgt_target_data_update" ->
"target_data_update(DeviceTy&, int, void**, void**, long*, long*)" ->
"std::_Rb_tree_increment(std::Rb_tree_node_base*)".

Thanks,
Chris

Quuxplusone / LLVMBugzillaTest

Poor present table performance #45077