comgr-related slowdown seen in TensorFlow

Hi,

In ROCm 2.6, we're seeing a significant startup-related slowdown in TensorFlow. For our TF CI, this adds over an hour to each of our build+test cycles.

Below is what we see when running tf_cnn_benchmarks using the 'trivial' model:

our benchmark's performance metrics (images/sec) are nearly equivalent
the wall clock times show a startup performance regression in ROCm 2.6
CPU profiler data shows extra entries for comgr yaml calls in ROCm 2.6

ROCm 2.5

wall clock (sec): 19.36 total images/sec: 7213.54

Overhead  Command          Shared Object                                      Symbol                                                                                                                                                                                          
  10.05%  tf_cnn_benchmar  libhip_hcc.so                                      [.] hip_impl::read<__gnu_cxx::__normal_iterator<char*, std::vector<char, std::allocator<char> > > >
   5.48%  tf_cnn_benchmar  [unknown]                                          [k] 0xffffffff8b792dec
   4.20%  tf_cnn_benchmar  [unknown]                                          [k] 0xffffffff8b7929b7
   2.90%  tf_cnn_benchmar  [unknown]                                          [k] 0xffffffff8b8015a0
   2.54%  python3          [unknown]                                          [k] 0xffffffff8ba03003
   2.03%  python3          [unknown]                                          [k] 0xffffffff8ae372ca
   1.46%  python3          libopenblasp-r0-39a31c03.2.18.so                   [.] blas_thread_server
   1.38%  python3          libopenblasp-r0-2ecf47d5.3.7.dev.so                [.] blas_thread_server
   1.34%  tf_cnn_benchmar  libc-2.23.so                                       [.] 0x000000000014dc76
   1.32%  python3          libc-2.23.so                                       [.] __sched_yield

ROCm 2.6

wall clock (sec): 56.83 total images/sec: 7238.64

Overhead  Command          Shared Object                                      Symbol                                                                                                                                                         
  19.34%  tf_cnn_benchmar  libamd_comgr.so                                    [.] YAML::RegEx::MatchUnchecked<YAML::StreamCharSource>
   4.49%  tf_cnn_benchmar  libhip_hcc.so                                      [.] hip_impl::read<__gnu_cxx::__normal_iterator<char*, std::vector<char, std::allocator<char> > > >
   2.82%  tf_cnn_benchmar  [unknown]                                          [k] 0xffffffff8b792dec
   2.31%  tf_cnn_benchmar  libamd_comgr.so                                    [.] YAML::ScanScalar[abi:cxx11]
   1.94%  tf_cnn_benchmar  [unknown]                                          [k] 0xffffffff8b7929b7
   1.75%  tf_cnn_benchmar  libamd_comgr.so                                    [.] YAML::Stream::StreamInUtf8
   1.73%  tf_cnn_benchmar  libc-2.23.so                                       [.] malloc
   1.67%  tf_cnn_benchmar  libamd_comgr.so                                    [.] YAML::Stream::_ReadAheadTo
   1.60%  tf_cnn_benchmar  [unknown]                                          [k] 0xffffffff8b8015a0
   1.29%  tf_cnn_benchmar  libamd_comgr.so                                    [.] std::_Rb_tree<std::shared_ptr<YAML::detail::node>, std::shared_ptr<YAML::detail::node>, std::_Identity<std::shared_ptr<YAML::detail::node> >, std::less<std

Reverting back to the comgr package from ROCm 2.5 did not have an effect. This might indicate that the problem is related to a user of comgr.

Please help us troubleshoot whether this is a comgr issue, a HIP issue, or something else. We don't have a strong understanding of the users of comgr.

Many thanks,

Jeff

ROCm / ROCm-CompilerSupport

comgr-related slowdown seen in TensorFlow #13

ROCm 2.5

ROCm 2.6