ValeevGroup / tiledarray

A massively-parallel, block-sparse tensor framework written in C++
GNU General Public License v3.0
247 stars 51 forks source link

rocBLAS calls do not produce correct results #419

Open evaleev opened 10 months ago

evaleev commented 10 months ago

HIP/ROCm support introduced in https://github.com/ValeevGroup/tiledarray/pull/418 is only minimally functional at the moment (but already sufficient to provide HIP support in https://github.com/devreal/ttg/tree/ttg-device-support-master-coro-with-stream-tasks) but when trying to use rocBLAS (via ICL's blaspp C++ API) it seems that nothing happens. Here's a simplified version of examples/device/device_task:

// copy data from arg.data() to result.data()
blas::copy(result.size(), arg.data(), 1, device_data(result.storage()), 1,
             queue);
hipStreamSynchronize(queue.stream());
TA_ASSERT(result.data()[0] == arg.data()[0]);

It fails in the assertion. Meanwhile

hipMemcpyAsync(result.data(),arg.data(),result.size()*sizeof(double),device::MemcpyDefault, stream);
hipStreamSynchronize(stream);
TA_ASSERT(result.data()[0] == arg.data()[0]);

succeeds.

Note that result.data() and arg.data() point to the unified memory (allocated via hipMallocManaged). So the only working hypothesis is that rocBLAS does not support operations on data in UM ...

evaleev commented 10 months ago

@dmcdougall : the issue seems to be resolved by setting env var HIP_VISIBLE_DEVICES=0. TA does not read this env var.

dmcdougall commented 10 months ago

I can't reproduce the failure locally:

$ cat repro.cpp 
#include <cassert>
#include <rocblas/rocblas.h>

#define hipCheck(s) \
do {\
  hipError_t err = s;\
  if (err != hipSuccess) {\
    printf( "Failed to run error %d ", __LINE__);\
    return -1;\
  }\
} while(0)\

#define rocblasCheck(s) \
do {\
  rocblas_status err = s;\
  if (err != rocblas_status_success) {\
    printf( "Failed to run error %d ", __LINE__);\
    return -1;\
  }\
} while(0)\

int main(int argc, char ** argv)
{
  int N = 1;
  size_t size = N * sizeof(double);

  double * arg, * result;

  hipCheck(hipMallocManaged(&arg, size));
  hipCheck(hipMallocManaged(&result, size));

  hipCheck(hipMemset(arg, 1, size));
  hipCheck(hipMemset(result, 0, size));

  hipStream_t stream;
  hipCheck(hipStreamCreate(&stream));

  rocblas_handle handle;
  rocblasCheck(rocblas_create_handle(&handle));
  rocblasCheck(rocblas_set_stream(handle, stream));

  rocblasCheck(rocblas_dcopy(handle, N, arg, 1, result, 1));  // copy arg into result
  hipCheck(hipStreamSynchronize(stream));
  assert(result[0] == arg[0]);  //fails?

  rocblas_destroy_handle(handle);
  hipCheck(hipStreamDestroy(stream));
  hipCheck(hipFree(arg));
  hipCheck(hipFree(result));
  return 0;
}
$ hipcc repro.cpp -L/opt/rocm-5.7.0/lib -o repro -lrocblas
$ env | grep HIP
$ env | grep ROCR
$ ./repro 
$ echo $?
0
$ rocminfo | grep gfx9
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-

Your observation about setting the device visibility in the environment is interesting. Are you launching your job with slurm with the cgroups plugin enabled?

dmcdougall commented 9 months ago

Could you also re-run your example with AMD_LOG_LEVEL=3 set in the environment? I want to see if there are any hip runtime calls in your example that aren't present in my example.

There will be quite a lot of output to the screen (stderr, I think), so I recommend piping to a file.

evaleev commented 9 months ago

@dmcdougall thanks for investigating ... I invoke the executable directly, so no slurm involved, and HIP_VISIBLE_DEVICES is not set by default.

Unfortunately my attempts to make the example more representative of the "real" example did not succeed to trigger the problem. In the real app all calls happen in thread pool so I thought maybe some thread-local state was not being initialized properly ... to no avail.

For the record, here's the most recent form of the example:

[ICL:dopamine ~]$ cat repro-thread.cc
#include <cassert>
#include <hip/hip_runtime.h>
#include <rocblas/rocblas.h>
#include <thread>
#include <iostream>
#include <atomic>

#define hipCheck(s) \
do {\
  hipError_t err = s;\
  if (err != hipSuccess) {\
    printf( "Failed to run error %d ", __LINE__);\
    return -1;\
  }\
} while(0)\

#define rocblasCheck(s) \
do {\
  rocblas_status err = s;\
  if (err != rocblas_status_success) {\
    printf( "Failed to run error %d ", __LINE__);\
    return -1;\
  }\
} while(0)\

using task_ptr = void (*)();

std::atomic<bool> done{false};
std::atomic<task_ptr> current_task{nullptr};

hipStream_t stream;
rocblas_handle handle;
const int N = 1000000;
double * arg, * result;

void do_work(){
    rocblas_dcopy(handle, N, arg, 1, result, 1);  // copy arg into result
    auto err = hipStreamSynchronize(stream);
    assert(result[0] == arg[0]);
  };

int main(int argc, char ** argv)
{
  size_t size = N * sizeof(double);

  // start worker thread
  std::thread worker([&]() {
    while (!done) {
      if (current_task != nullptr) {
        (*current_task)();
    done = true;
    current_task = nullptr;
      }
    }
  });

  hipCheck(hipMallocManaged(&arg, size));
  hipCheck(hipMallocManaged(&result, size));

  hipCheck(hipMemset(arg, 1, size));
  hipCheck(hipMemset(result, 0, size));

  hipCheck(hipStreamCreate(&stream));

  rocblasCheck(rocblas_create_handle(&handle));
  rocblasCheck(rocblas_set_stream(handle, stream));

  current_task = do_work;
  worker.join();

  rocblas_destroy_handle(handle);
  hipCheck(hipStreamDestroy(stream));
  hipCheck(hipFree(arg));
  hipCheck(hipFree(result));
  return 0;
}
[ICL:dopamine ~]$ hipcc -g -O0 repro-thread.cc -L/opt/rocm-5.7.0/lib -o repro-thread -lrocblas
[ICL:dopamine ~]$ ./repro-thread
dmcdougall commented 9 months ago

Ok, thanks.

Can you either:

  1. re-run your example with AMD_LOG_LEVEL=3 set in the environment? There are probably some hip runtime pieces I am missing; or
  2. show me how you're building TA and that example? That way I can experiment with it.