argonne-lcf / THAPI

A tracing infrastructure for heterogeneous computing applications.
Other
22 stars 9 forks source link

[draft] port cuda filter to metababel #164

Closed bd4 closed 8 months ago

bd4 commented 9 months ago

This tries to build metababel btx_*.c files with g++ compiler and fails to build.

bd4 commented 9 months ago

Note that I commited btx_cuda_model.yaml even though it's generated so I can track it during development, will amend commit to remove before final PR.

bd4 commented 9 months ago

Note that this has been rebased on btx_ze PR, which should be merged shortly.

TApplencourt commented 9 months ago

@Kerilk merged btx_ze 🎉 . You can rebase :)

bd4 commented 9 months ago

Rebased. Luckily I kept around the pre-rebased branch and just had to cherry pick my newest commits. One of the downsides of squashing is that it breaks rebase on a branch already rebased before the squash. Need someone to approve the workflow now.

TApplencourt commented 9 months ago

Before I forgot: Please also modify the configure to add the new requirement on the bumped version of metababel

bd4 commented 9 months ago

It will have compile time error if it does not match cast/type as you are suggesting. As I have it now, I think it would just match the one type and not filter the other, if another existed. right?

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Thomas Applencourt @.> Sent: Monday, January 8, 2024 12:43:22 PM To: argonne-lcf/THAPI @.> Cc: Bryce Allen @.>; Author @.> Subject: Re: [argonne-lcf/THAPI] [draft] port cuda filter to metababel (PR #164)

@TApplencourt commented on this pull request.


In cuda/btx_cudamatching_model.yamlhttps://urldefense.com/v3/__https://github.com/argonne-lcf/THAPI/pull/164*discussion_r1445066501__;Iw!!BpyFHLRN4TMTrA!9jhKCnSXMSD_upyz18UUbJfhbf5XdvGjWfdTDKxP0Aoca87XQCUBfewwcRO5CHr43Ua8ZokZZSSEceOsCjA8gYP2V4g$:

  • :field_class:
  • :cast_type: size_t
  • :type: integer_unsigned

No, it will be a compile-time error. As the ByteCount will have two types, hence two function signatures for the callbacks, which is obvious not possible.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/argonne-lcf/THAPI/pull/164*discussion_r1445066501__;Iw!!BpyFHLRN4TMTrA!9jhKCnSXMSD_upyz18UUbJfhbf5XdvGjWfdTDKxP0Aoca87XQCUBfewwcRO5CHr43Ua8ZokZZSSEceOsCjA8gYP2V4g$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAGJR4VSXG5NCNZBZXMPOZTYNQV3VAVCNFSM6AAAAABBOVWLNOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMBZGY4TCNJWGY__;!!BpyFHLRN4TMTrA!9jhKCnSXMSD_upyz18UUbJfhbf5XdvGjWfdTDKxP0Aoca87XQCUBfewwcRO5CHr43Ua8ZokZZSSEceOsCjA8ilsnCGw$. You are receiving this because you authored the thread.Message ID: @.***>

TApplencourt commented 9 months ago

It will have compile time error if it does not match cast/type as you are suggesting. As I have it now, I think it would just match the one type and not filter the other, if another existed. right?

Exactly! So the other approach will ensure that we handle them all, and no one falls into the cracks

bd4 commented 8 months ago

I can't reproduce the CI failure, not sure what is going on here. That file definitely was checked in and exists.

Edit: found it, needed to update utils makefile.

bd4 commented 8 months ago

The new test cases are very cuda specific, related to the context API. Probably should add cuda to the name and remove the ifs. The kernel_name and multithread cases could be made generic with some more testing.

bd4 commented 8 months ago

For my 11MB fft bench test case, master branch is ~2x faster than this feature branch. Not sure how much of this is from misc bug fixes vs actual regression.

bd4 commented 8 months ago

With my manual test cases, this is producing same results as master branch except for the expected differences:

  1. Reverse order of host event and traffic event
  2. Fix backend=3 for device events
  3. Add ts to traffic event

It also runs successfully on MPI/CUDA apps, specifically a test app using gtensor and MPI.

I think the main potential blocker is that at least on one test case, it is 2x slower than master branch. We can merge and then try to address the performance, or hold off and try to improve it first.