OpenCL dGPU CL_OUT_OF_RESOURES

pvelesko commented 10 months ago

Seems like there is an issue regarding MemMap / unmap operations

pjaaskel commented 10 months ago

Can you provide a bit more info?

pvelesko commented 10 months ago

Running multiple times on the dgpu seems to induce either CL_OUT_OF_RESOURES or a timeout.

../scripts/check.py ./ dgpu opencl --num-threads=24 --num-tries=30

Valgrind reports memory leaks most having clEnqueueSVM(Un)map in common:

==480751== 12,885,800 (8,043,440 direct, 4,842,360 indirect) bytes in 17,335 blocks are definitely lost in loss record 2,860 of 2,860
==480751==    at 0x4849013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==480751==    by 0x5A23D39: NEO::CommandQueueHw<NEO::XeHpgCoreFamily>::setupEvent(NEO::EventBuilder&, _cl_event**, unsigned int) (in /home/pvelesko/install/intel/neo/2023.10.02/lib/intel-opencl/libigdrcl.so)
==480751==    by 0x5A2D566: int NEO::CommandQueueHw<NEO::XeHpgCoreFamily>::enqueueBlit<4595u>(NEO::MultiDispatchInfo const&, unsigned int, _cl_event* const*, _cl_event**, bool, NEO::CommandStreamReceiver&) (in /home/pvelesko/install/intel/neo/2023.10.02/lib/intel-opencl/libigdrcl.so)
==480751==    by 0x5A6DD90: NEO::CommandQueueHw<NEO::XeHpgCoreFamily>::enqueueSVMUnmap(void*, unsigned int, _cl_event* const*, _cl_event**, bool) (in /home/pvelesko/install/intel/neo/2023.10.02/lib/intel-opencl/libigdrcl.so)
==480751==    by 0x57EC6A7: clEnqueueSVMUnmap (in /home/pvelesko/install/intel/neo/2023.10.02/lib/intel-opencl/libigdrcl.so)
==480751==    by 0x4C308CF: CHIPQueueOpenCL::MemUnmap(chipstar::AllocationInfo const*) (src/backend/OpenCL/CHIPBackendOpenCL.cc:1014)
==480751==    by 0x4B72C08: chipstar::Queue::RegisteredVarCopy(chipstar::ExecItem*, chipstar::Queue::MANAGED_MEM_STATE)::$_0::operator()(chipstar::AllocationInfo const&) const (src/CHIPBackend.cc:1732)
==480751==    by 0x4B701B8: void chipstar::AllocationTracker::visitAllocations<chipstar::Queue::RegisteredVarCopy(chipstar::ExecItem*, chipstar::Queue::MANAGED_MEM_STATE)::$_0>(chipstar::Queue::RegisteredVarCopy(chipstar::ExecItem*, chipstar::Queue::MANAGED_MEM_STATE)::$_0) const (src/CHIPBackend.hh:599)
==480751==    by 0x4B6FFC7: chipstar::Queue::RegisteredVarCopy(chipstar::ExecItem*, chipstar::Queue::MANAGED_MEM_STATE) (src/CHIPBackend.cc:1744)
==480751==    by 0x4B71194: chipstar::Queue::launch(chipstar::ExecItem*) (src/CHIPBackend.cc:1810)
==480751==    by 0x4B719DB: chipstar::Queue::launchKernel(chipstar::Kernel*, dim3, dim3, void**, unsigned long) (src/CHIPBackend.cc:1852)
==480751==    by 0x4BED852: hipLaunchKernelInternal(void const*, dim3, dim3, void**, unsigned long, ihipStream_t*) (src/CHIPBindings.cc:4017)

pjaaskel commented 10 months ago

It happens inside the driver, perhaps some sort of temp shadow buffer allocation. SVMUnmap is not supposed to allocate anything (for the client) except the command itself. And since the test works, the command should get executed and finished (and freed). Are we missing cl_context release?

pvelesko commented 9 months ago

OpenCL memory leaks were resolved but seems like a driver issue. We've seen similar behavior for the Level-Zero backend where after running certain unit tests, we would be left with a defunct kernel process tied to the i915 kernel module.

We not see similar behavior in OpenCL - we can observe the total time taken for unit tests to run to go from ~12 minutes all the way up to ~30 min at which point we start seeing failures with CL_OUT_OF_RESOURES - this would only make sense if something not getting cleaned up between CI runs.

I updated the runtime - let's see if this behavior persists but if so, I'd need to report/make a reproducer for this issue.

CHIP-SPV / chipStar

OpenCL dGPU CL_OUT_OF_RESOURES #690