KomputeProject / kompute

General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.
http://kompute.cc/
Apache License 2.0
2k stars 155 forks source link

Warning: ``std430`` basically required for some reason #401

Closed Mr-Thack closed 3 hours ago

Mr-Thack commented 3 hours ago

This isn't really an issue, but a warning to anyone who is new to Vulkan and is attempting to use this.

If it seems as if you're GPU is not reading the data you sent to it from the CPU, it might be because std430 seems to be required.

Here is an example of what I mean:

When you type this:

layout(binding = 0) buffer Input {
    uint slices[3];
};

The GLSL Compiler assumes you meant this:

layout(std140, binding = 0) buffer Input {
    uint slices[3];
};

But for some applications, what you actually need might be:

layout(std430, binding = 0) buffer Input {
    uint slices[3];
};

The stdXXX signifies the standard it'll be using the transmit the data to the GPU. I don't really know why or how we tell Kompute to use a different standard, but std430 is more efficient, so you might as well use that anyways.

Mr-Thack commented 3 hours ago

Minimum Reproducible Example

Directory Structure:

├── CMakeLists.txt
├── shader
│   └── adding.comp
└── src
    └── main.cpp

CMakeLists.txt

cmake_minimum_required(VERSION 3.20)
project(issue)

set(CMAKE_CXX_STANDARD 14)

# Options
option(KOMPUTE_OPT_GIT_TAG "The tag of the repo to use for the example" master)
option(KOMPUTE_OPT_FROM_SOURCE "Whether to build example from source or from git fetch repo" OFF)

if(KOMPUTE_OPT_FROM_SOURCE)
    add_subdirectory(../../ ${CMAKE_CURRENT_BINARY_DIR}/kompute_build)
else()
    include(FetchContent)
    FetchContent_Declare(kompute GIT_REPOSITORY https://github.com/KomputeProject/kompute.git
        GIT_TAG ${KOMPUTE_OPT_GIT_TAG})
    FetchContent_MakeAvailable(kompute)
    include_directories(${kompute_SOURCE_DIR}/src/include)
endif()

# Compiling shader
# To add more shaders simply copy the vulkan_compile_shader command and replace it with your new shader
vulkan_compile_shader(
  INFILE shader/adding.comp
  OUTFILE shader/adding.hpp
  NAMESPACE "shader")

# Then add it to the library, so you can access it later in your code
add_library(shader INTERFACE "shader/adding.hpp")
target_include_directories(shader INTERFACE $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}>)

# Setting up main example code
add_executable(issue src/main.cpp)
target_link_libraries(issue PRIVATE shader kompute::kompute)

adding.comp

#version 450

layout(std430, binding = 0) buffer Input {
    vec3 input[];
};

layout(std430, binding = 1) buffer Output {
    vec3 output[];
};

void main() {
    uint index = gl_GlobalInvocationID.x;

    // Just copy stuff from input to output
    output[index] = input[index];
}

main.cpp


#include <iostream>
#include <memory>
#include <vector>

#include <kompute/Kompute.hpp>
#include <kompute/Tensor.hpp>
#include <shader/adding.hpp>

typedef std::shared_ptr<kp::TensorT<uint32_t>> UTens;
typedef std::vector<uint32_t> uvec;

void printTensor(std::string name, UTens tensor) {
  std::cout << name;
  std::cout << ": {  ";
  for (const float &elem : tensor->vector()) {
    std::cout << elem << "  ";
  }
  std::cout << "}" << std::endl;
}

int main() {
  kp::Manager mgr;

  UTens input = mgr.tensorT(uvec{2, 4, 6});
  UTens output = mgr.tensorT(uvec{0, 0, 0});

  const std::vector<std::shared_ptr<kp::Memory>> params = {input, output};

  const std::vector<uint32_t> shader = std::vector<uint32_t>(
      shader::ADDING_COMP_SPV.begin(), shader::ADDING_COMP_SPV.end());
  std::shared_ptr<kp::Algorithm> algo = mgr.algorithm(params, shader);

  mgr.sequence()
      ->record<kp::OpSyncDevice>(params)
      ->record<kp::OpAlgoDispatch>(algo)
      ->record<kp::OpSyncLocal>(params)
      ->eval();

  printTensor("Input", input);
  printTensor("Output", output);
}

And then run with: mkdir build && cd build && cmake .. && cmake --build . && ./issue

If there is any typo, please forgive me. Thank you!

Mr-Thack commented 3 hours ago

Almost forgot to mention:

For some reason, it would output {2, 0, 0} instead of {2, 4, 6} when using std140.

axsaucedo commented 36 minutes ago

Looking at the standards description:

std140: This layout alleviates the need to query the offsets for definitions. The rules of std140 layout explicitly state the layout arrangement of any interface block declared with this layout. This also means that such an interface block can be shared across programs, much like shared. The only downside to this layout type is that the rules for packing elements into arrays/structs can introduce a lot of unnecessary padding.

The rules for std140 layout are covered quite well in the OpenGL specification (OpenGL 4.5, Section 7.6.2.2, page 137). Among the most important is the fact that arrays of types are not necessarily tightly packed. An array of floats in such a block will not be the equivalent to an array of floats in C/C++. The array stride (the bytes between array elements) is always rounded up to the size of a vec4 (ie: 16-bytes). So arrays will only match their C/C++ definitions if the type is a multiple of 16 bytes

Warning: Implementations sometimes get the std140 layout wrong for vec3 components. You are advised to manually pad your structures/arrays out and avoid using vec3 at all.

std430: This layout works like std140, except with a few optimizations in the alignment and strides for arrays and structs of scalars and vector elements (except for vec3 elements, which remain unchanged from std140). Specifically, they are no longer rounded up to a multiple of 16 bytes. So an array of floats will match with a C++ array of floats.

Note that this layout can only be used with shader storage blocks, not uniform blocks.

Particularly the warning seems to suggest that memory layout may be different so I am assuming that input values may be processed differently - but certainly an important catch, one of Kompute's principles is explicit instead of implicit so making it clear would be best