ConfettiFX / The-Forge

The Forge Cross-Platform Rendering Framework PC Windows, Steamdeck (native), Ray Tracing, macOS / iOS, Android, XBOX, PS4, PS5, Switch, Quest 2
Apache License 2.0
4.84k stars 506 forks source link

[Need Help] Integrated GPU - Shared memory with CPU #314

Closed Hideman85 closed 2 months ago

Hideman85 commented 2 months ago

I am in the phase of learning The Forge and I'm trying to get some help with this topic because I'm getting really confused right now.


I would like to be able to run some compute shader on my iGPU and profit of the shared memory with the CPU (Read/Write without transfer/same memory space).

So right now, I try with simple example, a compute shader that double each float of my array/buffer.

My shader double.comp.fsl ```hlsl RES(RWBuffer(float), myData, UPDATE_FREQ_NONE, b0, binding=0); // Main compute shader function NUM_THREADS(8, 8, 1) void CS_MAIN(SV_GroupThreadID(uint3) inGroupId, SV_GroupID(uint3) groupId) { INIT_MAIN; myData[inGroupId.x] *= 2.0; // Simple operation: double each float RETURN(); } ```
I'm able to find my integrated GPU ```c++ Renderer *pRenderer = nullptr; Renderer *pCompute = nullptr; void MyApp::Init() { RendererContextDesc contextSettings = {}; RendererContext* pContext = NULL; initRendererContext(GetName(), &contextSettings, &pContext); RendererDesc settings = {}; // Need one GPU for rendering and one for compute to simplify if (pContext && pContext->mGpuCount >= 2) { uint32_t queueFamilyCount = 0; VkPhysicalDeviceMemoryProperties memProperties; auto SHARED_FLAG = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT; int bestGpuIndex = -1; int bestProfile = -1; struct IntegratedComputeGPU { int idx; uint32_t mem;}; std::vector gComputeGPUs = {}; for (int i = 0; i < pContext->mGpuCount; i++) { auto profile = pContext->mGpus[i].mSettings.mGpuVendorPreset.mPresetLevel; if (profile > bestProfile) { std::string str = "======> GPU " + std::to_string(i) + " profile " + std::to_string(profile); LOGF(LogLevel::eINFO, str.c_str()); bestProfile = profile; bestGpuIndex = i; } auto device = pContext->mGpus[i].mVk.pGpu; auto& props = pContext->mGpus[i].mVk.mGpuProperties.properties; if (props.deviceType == VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU) { vkGetPhysicalDeviceQueueFamilyProperties(device, &queueFamilyCount, NULL); std::vector queueFamilies(queueFamilyCount); vkGetPhysicalDeviceQueueFamilyProperties(device, &queueFamilyCount, queueFamilies.data()); for (VkQueueFamilyProperties& queueFamily : queueFamilies) { if (queueFamily.queueFlags & VK_QUEUE_COMPUTE_BIT) { vkGetPhysicalDeviceMemoryProperties(device, &memProperties); for (uint32_t j = 0; j < memProperties.memoryTypeCount; j++) { if (memProperties.memoryTypes[j].propertyFlags & SHARED_FLAG) { gComputeGPUs.push_back({i, props.limits.maxComputeSharedMemorySize}); break; } } break; } } } } if (gComputeGPUs.size() > 0) { int bestComputeIndex = -1; uint32_t bestComputeMem = 0; for (auto &gpu : gComputeGPUs) { if (gpu.idx != bestGpuIndex && gpu.mem > bestComputeMem) { bestComputeIndex = gpu.idx; bestComputeMem = gpu.mem; } } // We have all our needs if (bestComputeIndex != -1) { std::string str = "======> Compute GPU "; str.append(pContext->mGpus[bestComputeIndex].mVk.mGpuProperties.properties.deviceName); LOGF(LogLevel::eINFO, str.c_str()); str = "======> Graphic GPU "; str.append(pContext->mGpus[bestGpuIndex].mVk.mGpuProperties.properties.deviceName); LOGF(LogLevel::eINFO, str.c_str()); settings.pContext = pContext; // First the render gpu settings.mGpuMode = GPU_MODE_SINGLE; settings.mGpuIndex = bestGpuIndex; initRenderer(GetName(), &settings, &pRenderer); if (!pRenderer) return false; // Second compute one settings.mGpuMode = GPU_MODE_UNLINKED; settings.mGpuIndex = bestComputeIndex; initRenderer(GetName(), &settings, &pCompute); if (!pCompute) return false; } } } // Default init if (!pRenderer) { LOGF(LogLevel::eINFO, "======> Fallback to single GPU"); initRenderer(GetName(), &settings, &pRenderer); if (!pRenderer) return false; } if (pCompute) addBuffer(); } ```
Shader, RootSignature, Pipeline, all good ```c++ void Compute::AddShaders() { ShaderLoadDesc desc = {}; desc.mStages[0].pFileName = "double.comp"; addShader(pCompute, &desc, &pComputeShader); } void Compute::RemoveShaders() { removeShader(pCompute, pComputeShader); } void Compute::AddRootSignatures() { RootSignatureDesc desc = { &pComputeShader, 1 }; addRootSignature(pCompute, &desc, &pRootSignature); }; void Compute::RemoveRootSignatures() { removeRootSignature(pCompute, pRootSignature); } void Compute::AddPipelines() { PipelineDesc pipelineDesc = {}; pipelineDesc.pName = "ComputePipeline"; pipelineDesc.mType = PIPELINE_TYPE_COMPUTE; ComputePipelineDesc& computePipelineSettings = pipelineDesc.mComputeDesc; computePipelineSettings.pShaderProgram = pComputeShader; computePipelineSettings.pRootSignature = pRootSignature; addPipeline(pCompute, &pipelineDesc, &pPipeline); }; void Compute::RemovePipelines() { removePipeline(pCompute, pPipeline); } ```


Now the part that I think I'm getting wrong, I try to create a buffer to the GPU from the existing memory 🤔

addBuffer() ```c++ // Taken from the The Forge renderer DECLARE_RENDERER_FUNCTION(void, addBuffer, Renderer* pCompute, const BufferDesc* pDesc, Buffer** pp_buffer) DECLARE_RENDERER_FUNCTION(void, removeBuffer, Renderer* pCompute, Buffer* pBuffer) Buffer* buff = nullptr; std::vector buff(100, 1.f); uint64_t totalSize = 100 * sizeof(float); void addBuffer() { BufferLoadDesc bDesc = {}; bDesc.mDesc.mDescriptors = DESCRIPTOR_TYPE_VERTEX_BUFFER | (DESCRIPTOR_TYPE_BUFFER_RAW | DESCRIPTOR_TYPE_RW_BUFFER_RAW); bDesc.mDesc.mSize = totalSize; bDesc.pData = buff.data(); ResourceSizeAlign rsa = {}; getResourceSizeAlign(&bDesc, &rsa); ResourceHeap* pHeap; ResourceHeapDesc desc = {}; desc.mDescriptors = DESCRIPTOR_TYPE_BUFFER | (DESCRIPTOR_TYPE_BUFFER_RAW | DESCRIPTOR_TYPE_RW_BUFFER_RAW); desc.mFlags = RESOURCE_HEAP_FLAG_SHARED; desc.mAlignment = rsa.mAlignment; desc.mSize = totalSize; addResourceHeap(pCompute, &desc, &pHeap); ResourcePlacement placement{pHeap}; BufferDesc buffDesc = {}; buffDesc.pName = "SharedBuffer"; buffDesc.mFlags = BUFFER_CREATION_FLAG_HOST_VISIBLE | BUFFER_CREATION_FLAG_HOST_COHERENT; buffDesc.mSize = totalSize; buffDesc.pPlacement = &placement; buffDesc.mFormat = TinyImageFormat_R32_SFLOAT; buffDesc.mDescriptors = DESCRIPTOR_TYPE_RW_BUFFER_RAW; buffDesc.mNodeIndex = pCompute->mUnlinkedRendererIndex; addBuffer(pCompute, &buffDesc, &buff); } ```


I would kindly appreciate help to get a simple example working 🙏 Thanks in advance 🙏

Hideman85 commented 2 months ago

In the end I found the right way to do it as follow:

SyncToken token = {};
BufferLoadDesc desc = {};
desc.mDesc.mDescriptors = DESCRIPTOR_TYPE_RW_BUFFER_RAW;
desc.mDesc.mFlags = BUFFER_CREATION_FLAG_PERSISTENT_MAP_BIT | BUFFER_CREATION_FLAG_HOST_VISIBLE | BUFFER_CREATION_FLAG_HOST_COHERENT;
desc.mDesc.mMemoryUsage = RESOURCE_MEMORY_USAGE_GPU_TO_CPU;
desc.mDesc.mStartState = RESOURCE_STATE_SHADER_RESOURCE;
desc.mDesc.mFormat = TinyImageFormat_R32_SFLOAT;
desc.mDesc.mSize = NB_ELEMENTS * sizeof(float);
desc.mDesc.mElementCount = NB_ELEMENTS;
desc.mDesc.mStructStride = sizeof(float);
desc.mDesc.mNodeIndex = pCompute->mUnlinkedRendererIndex;
desc.ppBuffer = &pComputeBuffer;
addResource(&desc, &token);
waitForToken(&token);
float* data = (float*)pComputeBuffer->pCpuMappedAddress;

The rest is already above.