AdaptiveCpp / AdaptiveCpp

Implementation of SYCL and C++ standard parallelism for CPUs and GPUs from all vendors: The independent, community-driven compiler for C++-based heterogeneous programming models. Lets applications adapt themselves to all the hardware in the system - even at runtime!
https://adaptivecpp.github.io/
BSD 2-Clause "Simplified" License
1.33k stars 163 forks source link

Memory Overflow using std::vector of buffers #840

Open Rhynden opened 1 year ago

Rhynden commented 1 year ago

Hi,

I'm curious what is going on with the memory management when using sycl::buffer inside of std::vector.

I have a very small example program:

#include <sycl/sycl.hpp>
#include <vector>
using namespace sycl;

int main()
{
    queue myQueue;
    auto dims = sycl::range<3>{64, 64, 64};
    std::vector<sycl::buffer<sycl::float2, 3>> vectorOfBuffers;

    for (int index = 0; index < 10000; index++)
    {
        sycl::buffer<sycl::float2, 3> complexIn{dims};
        myQueue.submit([&](sycl::handler &cgh)
                       {
                        sycl::accessor Matrix{complexIn, cgh, sycl::write_only, sycl::no_init};

                        // Ensycl::queue a parallel kernel iterating on a N*M 2D iteration space
                        cgh.parallel_for(dims, [=](sycl::id<3> index) {
                        sycl::float2 temp = {
                            index[2] + index[1] * dims[1] + index[0] * dims[0] * dims[0], 1.0f};
                        Matrix[index] = temp;
                        }); });

        vectorOfBuffers.push_back(complexIn);
    }
}

Compiling for nvidia gpu and running this program gives me the following errors:

/hipSYCL/src/runtime/dag_direct_scheduler.cpp:109 @ ensure_allocation_exists(): dag_direct_scheduler: Lazy memory allocation has failed.
/hipSYCL/src/runtime/cuda/cuda_allocator.cpp:48 @ allocate(): cuda_allocator: cudaMalloc() failed (error code = CUDA:2)

I'm curious what is going on. From monitoring I can see that the GPU memory slowly fills up and then probably overflows. Why is all the data from the buffers allocated all the time on the GPU? Shouldn't SYCL be able to copy data to the GPU only when necessary and keep the rest on the host or copy it back to the host when not needed anymore?

Kind regards

illuhad commented 1 year ago

Why is all the data from the buffers allocated all the time on the GPU? Shouldn't SYCL be able to copy data to the GPU only when necessary and keep the rest on the host or copy it back to the host when not needed anymore?

In theory, yes. In practice, no SYCL implementation does this to my knowledge. It's just too hard to know if/when data will be needed again on GPU, and you really don't want to pay the price for unnecessary data transfers because memory eviction heuristics fail. For best performance, the most reliable strategy is therefore usually to just use persistent allocations. This is also what allows extensions such as hipSYCL's buffer-USM interoperability: https://github.com/illuhad/hipSYCL/blob/develop/doc/buffer-usm-interop.md

The behavior of the hipSYCL runtime with respect to buffers is described in more detail here: https://github.com/illuhad/hipSYCL/blob/develop/doc/runtime-spec.md (note the section "persistent allocations")

If you want things like oversubscription of GPU memory (it will probably always cost you performance, so I'd not recommend it), you can use shared USM allocations. If backends support it, it will be mapped to memory that automatically migrates between host and device using pagefaulting mechanisms. In this case, oversubscription is managed at the driver level which is probably a better approach if you really need it.