ROCm / roctracer

ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
https://rocm.docs.amd.com/projects/roctracer/en/latest/
Other
64 stars 30 forks source link

used roctracer api but compiler generated bunch of errors #94

Open jdgh000 opened 8 months ago

jdgh000 commented 8 months ago

Following simple vector c++ without tracer code support will compile and run ok: hipcc ; ./a.out

if I add roctracer_ext.h include and compile with roctracer library, bunch of errors spit out: hipcc -I /opt/rocm-5.7.0/roctracer/include/ /opt/rocm-5.7.0/lib/libroctracer64.so error log with compile with roctracer support:

 file included from p61.cpp:3:
/opt/rocm-5.7.0/roctracer/include/roctracer_ext.h:33:2: warning: "This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with roctracer" [-W#warnings]
#warning "This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with roctracer"
 ^
p61.cpp:16:9: warning: 'N' macro redefined [-Wmacro-redefined]
#define N 4095
        ^
p61.cpp:15:9: note: previous definition is here
#define N 536870912
        ^
2 warnings generated when compiling for gfx90a.
In file included from p61.cpp:3:
/opt/rocm-5.7.0/roctracer/include/roctracer_ext.h:33:2: warning: "This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with roctracer" [-W#warnings]
#warning "This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with roctracer"
 ^
p61.cpp:16:9: warning: 'N' macro redefined [-Wmacro-redefined]
#define N 4095
        ^
p61.cpp:15:9: note: previous definition is here
#define N 536870912
        ^
2 warnings generated when compiling for host.
/opt/rocm-5.7.0/lib/libroctracer64.so:1:1: error: expected unqualified-id
<U+007F>ELF<U+0002><U+0001><U+0001><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0003><U+0000>><U+0000><U+0001><U+0000><U+0000><U+0000>0j<U+0000><U+0000><U+0000><U+0000><U+0000><U+0000>@<U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000>X6<U+0005><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000>@<U+0000>8
^
/opt/rocm-5.7.0/lib/libroctracer64.so:1:8: warning: null character ignored [-Wnull-character]
<U+007F>ELF<U+0002><U+0001><U+0001><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0003><U+0000>><U+0000><U+0001><U+0000><U+0000><U+0000>0j<U+0000><U+0000><U+0000><U+0000><U+0000><U+0000>@<U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000>X6<U+0005><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000>@<U+0000>8

...

><U+0008><U+0005><U+0000><U+0000><U+0000><U+0000><U+0000><80><U+0008><U+0005><U+0000><U+0000><U+0000><U+0000><U+0000><80><U+0007><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><80><U+0007><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0001><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0000><U+0004><U+0000><U+0000><U+0000><U+0014><U+0000><U+0000><U+0000><U+0003><U+0000><U+0000><U+0000>GNU<U+0000><AE><98>wv<E3><U+0016><R
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
316 warnings and 20 errors generated when compiling for gfx90a.

example vector add code:

#include <stdio.h>
#include "hip/hip_runtime.h"
//#include <roctracer_ext.h>

// 1. if N is set to up to 1024, then sum is OK.
// 2. Set N past the 1024 which is past No. of threads per blocks, and then all iterations of sum results in 
// even the ones within the block.

// 3. To circumvent the problem described in 2. above, since if N goes past No. of threads per block, we need multiple block launch.
// The trick is describe in p65 to use formula (N+127) / 128 for blocknumbers so that when block number starts from 1, it is 
// (1+127) / 128.

#define N 536870912 
#define N 4095
#define MAX_THREAD_PER_BLOCK 1024

__global__ void add( int * a, int * b, int * c ) {
    int tid = hipThreadIdx_x + hipBlockIdx_x * hipBlockDim_x ;
    if (tid < N) 
        c[tid] = a[tid] + b[tid];
}    

int main (void) {
    int *a, *b, *c;
    int *dev_a, *dev_b, *dev_c;
    int stepSize;

    int count = 0;

    hipGetDeviceCount(&count);

    printf("\nDevice count: %d.", count);

    // allocate dev memory for N size for pointers declared earlier.

    printf("\nAllocating memory...(size %u array size of INT).\n", N );

    a = (int*)malloc(N * sizeof(int));
    b = (int*)malloc(N * sizeof(int));
    c = (int*)malloc(N * sizeof(int));
    hipMalloc( (void**)&dev_a, N * sizeof(int));
    hipMalloc( (void**)&dev_b, N * sizeof(int));
    hipMalloc( (void**)&dev_c, N * sizeof(int));

    for (int i = 0; i < N; i++) {
        a[i] = i;
        b[i] = i/2;
        c[i] = 555;
    }

    // copy the initialized local memory values to device memory. 

    printf("\nCopy host to device...");
    hipMemcpy(dev_a, a, N * sizeof(int), hipMemcpyHostToDevice);
    hipMemcpy(dev_b, b, N * sizeof(int), hipMemcpyHostToDevice);
    hipMemcpy(dev_c, c, N * sizeof(int), hipMemcpyHostToDevice);

    const unsigned threadsPerBlock = 256;
    //const unsigned blocks = (N+threadsPerBlock + 1) / threadsPerBlock;
    const unsigned blocks = N/threadsPerBlock;

    // invoke the kernel: 
    // block count: (N+127)/128
    // thread count: 128

    hipLaunchKernelGGL(add, blocks, threadsPerBlock, 0, 0, dev_a, dev_b, dev_c);
    //roctracer_start();
    hipLaunchKernelGGL(add, blocks, threadsPerBlock, 0, 0, dev_a, dev_b, dev_c);
    //roctracer_stop();
    hipLaunchKernelGGL(add, blocks, threadsPerBlock, 0, 0, dev_a, dev_b, dev_c);
    //add<<<blocks, threadsPerBlock>>>(dev_a, dev_b, dev_c);
    hipMemcpy(a, dev_a, N * sizeof(int), hipMemcpyDeviceToHost);
    hipMemcpy(b, dev_b, N * sizeof(int), hipMemcpyDeviceToHost);
    hipMemcpy(c, dev_c, N * sizeof(int), hipMemcpyDeviceToHost);

    stepSize = N / 20;
    stepSize &=  ~(stepSize & 0x0f);
    printf("stepSize: %u\n", stepSize);
    for (int i = 0; i < N; i+=stepSize) {
        printf("%d: %d + %d = %d\n", i, a[i], b[i], c[i]);
    }

    hipFree(dev_a);
    hipFree(dev_b);
    hipFree(dev_c);
    free(a);
    free(b);
    free(c);
}
ammarwa commented 8 months ago

Hello,

Can you please try to replace #include <roctracer_ext.h> with #include <roctracer/roctracer_ext.h> Also, use the following to compile: hipcc -I/opt/rocm-5.7.0/include -L/opt/rocm-5.7.0/lib -lroctracer64 <filename.cpp>?

Please let me know if this works for you.

jdgh000 commented 8 months ago

Ok, that worked, thanks for assistance on this one. However, how would one use it to tracing to be captured on start/stop? I used "rocprof --hip-trace ./a.out" and resulting json has two entries of kernel(add() and also hipmemcpy. Was expecting only 2nd call to add will be captured...

egrep -irn add [root@localhost ex-1]# egrep -irn "trace|add" .cpp

3:#include <roctracer/roctracer_ext.h>
18:__global__ void add( int * a, int * b, int * c ) {
67:    hipLaunchKernelGGL(add, blocks, threadsPerBlock, 0, 0, dev_a, dev_b, dev_c);
71:    roctracer_start();
72:    hipLaunchKernelGGL(add, blocks, threadsPerBlock, 0, 0, dev_a, dev_b, dev_c);
74:    roctracer_stop();
75:    hipLaunchKernelGGL(add, blocks, threadsPerBlock, 0, 0, dev_a, dev_b, dev_c);
76:    //add<<<blocks, threadsPerBlock>>>(dev_a, dev_b, dev_c);
[root@localhost ex-1]# egrep -irn "kernel.*add" results.json
161:    "args":"( kernel(add(int*, int*, int*)) function_address(0x562bb6ce5d90) numBlocks({z(1) y(1) x(15}) dimBlocks({z(1) y(1) x(256}) args(0x7ffc3b35f040) sharedMemBytes(0) stream(1))",
197:    "args":"( kernel(add(int*, int*, int*)) function_address(0x562bb6ce5d90) numBlocks({z(1) y(1) x(15}) dimBlocks({z(1) y(1) x(256}) args(0x7ffc3b35f040) sharedMemBytes(0) stream(1))",
[root@localhost ex-1]# nano -w results.json
[root@localhost ex-1]# egrep -irn "hipmemcpy" results.
grep: results.: No such file or directory
[root@localhost ex-1]# egrep -irn "hipmemcpy" results.json
94:,{"ph":"X","name":"hipMemcpy","pid":"2","tid":"795","ts":"31035794492","dur":"184178",
100:    "Name":"hipMemcpy",
106:,{"ph":"X","name":"hipMemcpy","pid":"2","tid":"795","ts":"31035978679","dur":"17",
112:    "Name":"hipMemcpy",
118:,{"ph":"X","name":"hipMemcpy","pid":"2","tid":"795","ts":"31035978696","dur":"14",
124:    "Name":"hipMemcpy",
ammarwa commented 8 months ago

You will need to use '--trace-start off' option in rocprof along with the roctracer_start() and roctracer_stop() in your application as you already have.

jdgh000 commented 8 months ago

OK, that appears to work, as well as roctx. Thanks.

jdgh000 commented 8 months ago

One more question, can python use those apis? So far, i see mostly C++ codes, assuming no support for python.

ammarwa commented 8 months ago

For roctx, it should be working using python C libraries that can be used to call C/C++ functions inside python code. But I can't 100% confirm that for roctracer API calls.