argonne-lcf / THAPI

A tracing infrastructure for heterogeneous computing applications.
Other
22 stars 9 forks source link

`xprof` in ruby #162

Closed TApplencourt closed 8 months ago

TApplencourt commented 9 months ago

Feebacks are welcome now before the PR starts begging too big...

TApplencourt commented 9 months ago

Ready for your review @Kerilk . Less code than before, but still large. Sorry It's more or less a 1 to 1 mapping of the old .sh but in ruby.

I'm using Ruby's nice Logging capability. Feedback on the usage of Open3 and stuff is appreciated.

We may need to add some teardown and stuff to handle cases where the apps passed argument will crash.

Kerilk commented 9 months ago

Aren't you missing OpenMP, CUDA, and HIP support?

TApplencourt commented 9 months ago
  h = Hash.new { |h, k| h[k] = [] }
  [%w[opencl cl libOpenCL libTracerOpenCL],
   %w[ze ze libze_loader libTracerZE],
   %w[cuda cuda libcuda libTracerCUDA],
   %w[hip hip libamdhip64 libTracerHIP]].each do |name, bt_name, lib, libtracer|

Should be good! I tested ze and cl. (and OMP handled bellow in a special case)

I removed all the *prof because now we can pass --backend to the new iprof to restrict with backend to trace.

TApplencourt commented 9 months ago

Oh for the enable_events_* , yeah I'm stupid indeed! Thanks!

Kerilk commented 9 months ago

Yeah, I could have been more clear, sorry about that.

TApplencourt commented 9 months ago

Added support by d941. Also, implemented a little optimization to enable events only for the backend where we found the libs.

TApplencourt commented 9 months ago

Those failing have nothing to do with the PR. Will investigate. It looks like some issue in our testing framework.

The bug was likely triggered due to an update on one of the dependencies but existed since forever. Of course, I cannot reproduce on my machine...

Kerilk commented 9 months ago

First step would be archiving the logs after the run, but we already do for standard runs. For distcheck and dist and check, we would need to find the right folder

Kerilk commented 9 months ago

This is the error for reference:

  + BINDING_DIR=. DUST_MODELS_DIR=/home/runner/work/THAPI/THAPI/build/cuda/:/home/runner/work/THAPI/THAPI/build/../xprof/ BABELTRACE_PLUGIN_PATH=./.libs/ DUST_TRACE_DIR=/home/runner/work/THAPI/THAPI/build/../cuda/tests:/home/runner/work/THAPI/THAPI/build/cuda/tests ruby ../../utils/bt2.rb -f ./tests/interval_profiling_normal.dust
  /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/trace-ir/field.rb:18: [BUG] Segmentation fault at 0x0000000000000051
  ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux-gnu]

  -- Control frame information -----------------------------------------------
  c:0014 p:---- s:0067 e:000066 CFUNC  :bt_field_get_class_type
  c:0013 p:0022 s:0062 e:000060 METHOD /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/trace-ir/field.rb:18
  c:0012 p:0037 s:0055 e:000054 METHOD /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/trace-ir/event.rb:72
  c:0011 p:0055 s:0050 e:000049 BLOCK  /home/runner/work/THAPI/THAPI/utils/bt_plugins/comparator.rb:32 [FINISH]
  c:0010 p:---- s:0044 e:000043 CFUNC  :each
  c:0009 p:0012 s:0040 e:000039 BLOCK  /home/runner/work/THAPI/THAPI/utils/bt_plugins/comparator.rb:29 [FINISH]
  c:0008 p:---- s:0035 e:000034 CFUNC  :each
  c:0007 p:0008 s:0031 e:000030 METHOD /home/runner/work/THAPI/THAPI/utils/bt_plugins/comparator.rb:28 [FINISH]
  c:0006 p:---- s:0026 e:000025 IFUNC
  c:0005 p:0015 s:0023 e:000021 BLOCK  /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/graph/component-class-dev.rb:55 [FINISH]
  c:0004 p:---- s:0017 e:000016 CFUNC  :bt_graph_put_ref
  c:0003 p:---- s:0014 e:000013 CFUNC  :call
  c:0002 p:0019 s:0009 e:000008 METHOD /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/types.rb:587 [FINISH]
  c:0001 p:0000 s:0003 E:000680 (none) [FINISH]

  -- Ruby level backtrace information ----------------------------------------
  /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/types.rb:587:in `call'
  /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/types.rb:587:in `call'
  /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/types.rb:587:in `bt_graph_put_ref'
  /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/graph/component-class-dev.rb:55:in `block in _wrap_component_class_finalize_method'
  /home/runner/work/THAPI/THAPI/utils/bt_plugins/comparator.rb:28:in `finalize_method'
  /home/runner/work/THAPI/THAPI/utils/bt_plugins/comparator.rb:28:in `each'
  /home/runner/work/THAPI/THAPI/utils/bt_plugins/comparator.rb:29:in `block in finalize_method'
  /home/runner/work/THAPI/THAPI/utils/bt_plugins/comparator.rb:29:in `each'
  /home/runner/work/THAPI/THAPI/utils/bt_plugins/comparator.rb:32:in `block (2 levels) in finalize_method'
  /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/trace-ir/event.rb:72:in `get_payload_field'
  /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/trace-ir/field.rb:18:in `from_handle'
  /var/lib/gems/3.0.0/gems/babeltrace2-0.1.4/lib/babeltrace2/trace-ir/field.rb:18:in `bt_field_get_class_type'

so most probably a lifetime issue somewhere in our dust plugin...

TApplencourt commented 9 months ago

Yep, I did that last time and sent it to you on slack. At least know we know it's the same error :)

TApplencourt commented 8 months ago

Nice the tests are now passing! :D

TApplencourt commented 8 months ago

Ah fuck, why the PR is so big now >< I screw up the rebase with master... Will fix

TApplencourt commented 8 months ago

Thanks, Bryce is adding new MPI launcher support then we can merge!

Kerilk commented 8 months ago

Is it really a good idea to make this one bigger than it already is? Or is this one broken as is or not replacing the original application in any way?

TApplencourt commented 8 months ago

It's just 3 new ENV to grab to allow Bryce to use MPI + CUDA on his box. But yeah we will stop here. Will add the fancy other MPI launcher latter.

Fixed a bug with traced-ranks, and verified that the CUDA works. We can merge now.