Open DifferentialityDevelopment opened 1 month ago
I've managed to upgrade it to do a dot product between two matrices of x * x
As a test I did a dot product between two matrices of 2048 * 2048 and it's using floats for each element.
Next I want to upgrade it to handle matrices that are not a multiple of 32 and who are not square, then try and build a compute shader than can do attention with softmax on qkv matrices.
The only real tricky thing is building the compute shader correctly, integrating it wouldn't be that hard, the workers/root node could selectively offload certain calculations to be processed in the compute shader by the GPU, and you can account for GPU memory restrictions by splitting the load up into sequential calls to the GPU.
As before here is a copy of the code as it is now: compute-matmul.zip
Great! I was wondering about CUDA as the first accelerator but for Raspberry Pi Vulcan may be a better choice.
Please check the llama.cpp repository, they have implemented the matrix multiplication already.
Great! I was wondering about CUDA as the first accelerator but for Raspberry Pi Vulcan may be a better choice.
Please check the llama.cpp repository, they have implemented the matrix multiplication already.
Yeah Vulkan is nice because it has a wide range of support, not just for SBC's but also for computers with AMD or Nvidia cards.
I've basically got dot product multiplication done but there are some nitty gritty difficult issues that I still need to figure out.
Also yeah not a bad idea, busy looking at these shaders right now. https://github.com/ggerganov/llama.cpp/blob/master/kompute-shaders/op_mul_mat_f16.comp
I've at least gained most of the knowledge now that I need to actually utilize their shaders.
https://ai.google.dev/edge/lite/microcontrollers/python ? for raspberry pi https://github.com/tensorflow/tflite-micro
https://www.tensorflow.org/guide/distributed_training distributed tensorflow?
I've been working on how I'd integrate it into distributed-llama and I think I have an decent idea of how to go about it. With the current way I have it I can offload certain things to vulkan compute shaders, so for instance I can start with just doing the llamaQkv task and where it does the 3 matmul calls, it instead calls matmulVulkan Everything is seamlessly added with ifdef's All you would need to do to enable the vulkan features is do a make main VULKAN=1
I am working towards getting atleast the 6 matmul functions offloaded to vulkan, then I'll submit a pull request for it, will have to see in practice how well it performs.
Ideally I'd have 1 compute shader for all 6, but for simplicity sake I'm going to use 6 different compute shaders, once for each: matmulF32, matmulF16, matmulQ40, matmulQ80, matmulQ40vQ80 & matmulQ80vQ80
This is my in progress branch https://github.com/DifferentialityDevelopment/distributed-llama/tree/vulkan-acceleration Not quite working just yet, but it's mostly integrated, just need to sort out the kinks now
@b4rtaz Getting there...
Well I actually managed to get vulcan acceleration working!
./vulkan-test WARNING: dzn is not a conformant Vulkan implementation, testing use only. Created Vulkan Instance! Device Name: Microsoft Direct3D12 (NVIDIA GeForce RTX 3060) API Version: 1.2.274 Create the buffers Get memory requirements for the buffers Allocate and map memory for the buffers Bind the memory to the buffers Copy the weights to GPU memory Copy the input to GPU memory Copy the matmul info to GPU memory Bind the buffers to the descriptor sets Write and update the descriptor sets Create a pointer to the commandBuffer member Bind pipeline and descriptor sets to the command buffer Wait for the compute shader to finish Got the output from the compute shader ✅ matmulQ80 ✅ matmulQ80vQ80
Only matmulF32 at the moment, want to these next matmulF16, matmulQ40, matmulQ80, matmulQ40vQ80 and matmulQ80vQ80
Once I've got them done as well then I'll do some speed tests to see what kind of an uplift this has.
Need to figure out what Vulkan extensions I need to enable to support the Q40 and Q80 data types in the compute shader.
The actual shader implementation isn't that complicated luckily, but I need to use specific data types.
I'm very close now, just need to get the compute shader code to work correctly. I wrote tests to be able to compare the matmul results between CPU and GPU to ensure correctness.
I've successfully gotten matmulQ40Q80 to run via compute shader on Vulkan 🔥 and get results back that are nearly 1 = 1 with CPU calculated results. You won't be able to run Vulkan mode behind WSL, not until int8 support comes via the drivers. Going to do some speed tests now, have to install linux on a SSD and boot up direct to linux, only way to get native support for int8 on Vulkan. However, this whole thing is only a problem as long as this project is linux only, if it can run natively in windows then it's a different story altogether, but that would require a threading implementation that works cross platform, Windows doesn't work with pthread.h
One last thing I need to figure out, is that if I run it in inference mode, and have more than 1 thread running at a time, then it bugs out, it's not yet multithread capable, but will sort that out soon.
In the meantime, I can check how fast it is compared to CPU on a single matmul pass.
Just printed the first 32 floats from coming from each:
CPU Results: 0.00731812 0.00678142 0.00680221 0.00671218 0.00693316 0.00689348 0.00708914 0.00695994 0.00664946 0.00704365 0.00665891 0.00735391 0.00661354 0.00689124 0.00729823 0.0068318 0.00696582 0.00684787 0.00673844 0.0071383 0.00692065 0.00697429 0.00682781 0.00695222 0.0068927 0.00702631 0.00696984 0.00717608 0.00726813 0.00741034 0.00734587 0.00691799
Vulkan Results: 0.00737184 0.0069693 0.00689848 0.00662084 0.00692127 0.00690478 0.00713998 0.00661354 0.00675865 0.00720126 0.00705041 0.00735674 0.00685534 0.00662042 0.00709984 0.00699035 0.00676422 0.00702015 0.00673056 0.00712065 0.00695963 0.00703757 0.00696171 0.00701772 0.00682509 0.00709918 0.00704923 0.00718651 0.00713319 0.00719681 0.00736942 0.00710319
✅ matmulQ40Q80
I setup linux on another partition so that I could get native int8 GPU functionality without WSL ruining the party.
Then I ran some speed tests of a single pass, the shader runs fairly well, at very low matrix dimensions the CPU is actually faster, but I haven't yet been able to test large enough dimensions due to some weird bug that's causing a segfault that I can't figure out why yet.
n: number of rows d: number of columns
n = 512, d = 256 CPU matmulQ40Q80 - Duration: 0.015148 ms Shader execution time: 0.117442 ms Vulkan matmulQ40Q80 - Duration: 0.965542 ms
n = 512, d = 1024 CPU matmulQ40Q80 - Duration: 0.07343 ms Shader execution time: 0.207296 ms Vulkan matmulQ40Q80 - Duration: 1.44746 ms
n = 512, d = 3072 CPU matmulQ40Q80 - Duration: 0.184752 ms Shader execution time: 0.292468 ms Vulkan matmulQ40Q80 - Duration: 2.18394 ms
Raspberry Pi3B+ 22.04 Ubuntu Server Vulkan Information"
ubuntu@ubuntu:~$ vulkaninfo
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 4. Skipping ICD.
'DISPLAY' environment variable not set... skipping surface info
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.204
Instance Extensions: count = 20
===============================
VK_EXT_acquire_drm_display : extension revision 1
VK_EXT_acquire_xlib_display : extension revision 1
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_direct_mode_display : extension revision 1
VK_EXT_display_surface_counter : extension revision 1
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_display : extension revision 23
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2 : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_surface_protected_capabilities : extension revision 1
VK_KHR_wayland_surface : extension revision 6
VK_KHR_xcb_surface : extension revision 6
VK_KHR_xlib_surface : extension revision 6
Layers: count = 2
=================
VK_LAYER_MESA_device_select (Linux device selection layer) Vulkan version 1.3.211, layer version 1:
Layer Extensions: count = 0
Devices: count = 1
GPU id = 0 (llvmpipe (LLVM 15.0.7, 128 bits))
Layer-Device Extensions: count = 0
VK_LAYER_MESA_overlay (Mesa Overlay layer) Vulkan version 1.3.211, layer version 1:
Layer Extensions: count = 0
Devices: count = 1
GPU id = 0 (llvmpipe (LLVM 15.0.7, 128 bits))
Layer-Device Extensions: count = 0
Device Groups:
==============
Group 0:
Properties:
physicalDevices: count = 1
llvmpipe (LLVM 15.0.7, 128 bits) (ID: 0)
subsetAllocation = 0
Present Capabilities:
llvmpipe (LLVM 15.0.7, 128 bits) (ID: 0):
Can present images from the following devices: count = 1
llvmpipe (LLVM 15.0.7, 128 bits) (ID: 0)
Present modes: count = 1
DEVICE_GROUP_PRESENT_MODE_LOCAL_BIT_KHR
Device Properties and Extensions:
=================================
GPU0:
VkPhysicalDeviceProperties:
---------------------------
apiVersion = 4206847 (1.3.255)
driverVersion = 1 (0x0001)
vendorID = 0x10005
deviceID = 0x0000
deviceType = PHYSICAL_DEVICE_TYPE_CPU
deviceName = llvmpipe (LLVM 15.0.7, 128 bits)
pipelineCacheUUID = 32332e32-2e31-2d31-7562-756e7475332e
VkPhysicalDeviceLimits:
-----------------------
maxImageDimension1D = 16384
maxImageDimension2D = 16384
maxImageDimension3D = 4096
maxImageDimensionCube = 32768
maxImageArrayLayers = 2048
maxTexelBufferElements = 134217728
maxUniformBufferRange = 65536
maxStorageBufferRange = 134217728
maxPushConstantsSize = 256
maxMemoryAllocationCount = 4294967295
maxSamplerAllocationCount = 32768
bufferImageGranularity = 0x00000040
sparseAddressSpaceSize = 0x00000000
maxBoundDescriptorSets = 8
maxPerStageDescriptorSamplers = 1000000
maxPerStageDescriptorUniformBuffers = 1000000
maxPerStageDescriptorStorageBuffers = 1000000
maxPerStageDescriptorSampledImages = 1000000
maxPerStageDescriptorStorageImages = 1000000
maxPerStageDescriptorInputAttachments = 1000000
maxPerStageResources = 1000000
maxDescriptorSetSamplers = 1000000
maxDescriptorSetUniformBuffers = 1000000
maxDescriptorSetUniformBuffersDynamic = 1000000
maxDescriptorSetStorageBuffers = 1000000
maxDescriptorSetStorageBuffersDynamic = 1000000
maxDescriptorSetSampledImages = 1000000
maxDescriptorSetStorageImages = 1000000
maxDescriptorSetInputAttachments = 1000000
maxVertexInputAttributes = 32
maxVertexInputBindings = 32
maxVertexInputAttributeOffset = 2047
maxVertexInputBindingStride = 2048
maxVertexOutputComponents = 128
maxTessellationGenerationLevel = 64
maxTessellationPatchSize = 32
maxTessellationControlPerVertexInputComponents = 128
maxTessellationControlPerVertexOutputComponents = 128
maxTessellationControlPerPatchOutputComponents = 128
maxTessellationControlTotalOutputComponents = 4096
maxTessellationEvaluationInputComponents = 128
maxTessellationEvaluationOutputComponents = 128
maxGeometryShaderInvocations = 32
maxGeometryInputComponents = 64
maxGeometryOutputComponents = 128
maxGeometryOutputVertices = 1024
maxGeometryTotalOutputComponents = 1024
maxFragmentInputComponents = 128
maxFragmentOutputAttachments = 8
maxFragmentDualSrcAttachments = 2
maxFragmentCombinedOutputResources = 104
maxComputeSharedMemorySize = 32768
maxComputeWorkGroupCount: count = 3
65535
65535
65535
maxComputeWorkGroupInvocations = 1024
maxComputeWorkGroupSize: count = 3
1024
1024
1024
subPixelPrecisionBits = 8
subTexelPrecisionBits = 8
mipmapPrecisionBits = 4
maxDrawIndexedIndexValue = 4294967295
maxDrawIndirectCount = 4294967295
maxSamplerLodBias = 16
maxSamplerAnisotropy = 16
maxViewports = 16
maxViewportDimensions: count = 2
16384
16384
viewportBoundsRange: count = 2
-32768
32768
viewportSubPixelBits = 0
minMemoryMapAlignment = 64
minTexelBufferOffsetAlignment = 0x00000010
minUniformBufferOffsetAlignment = 0x00000010
minStorageBufferOffsetAlignment = 0x00000010
minTexelOffset = -32
maxTexelOffset = 31
minTexelGatherOffset = -32
maxTexelGatherOffset = 31
minInterpolationOffset = -2
maxInterpolationOffset = 2
subPixelInterpolationOffsetBits = 8
maxFramebufferWidth = 16384
maxFramebufferHeight = 16384
maxFramebufferLayers = 2048
framebufferColorSampleCounts: count = 2
SAMPLE_COUNT_1_BIT
SAMPLE_COUNT_4_BIT
framebufferDepthSampleCounts: count = 2
SAMPLE_COUNT_1_BIT
SAMPLE_COUNT_4_BIT
framebufferStencilSampleCounts: count = 2
SAMPLE_COUNT_1_BIT
SAMPLE_COUNT_4_BIT
framebufferNoAttachmentsSampleCounts: count = 2
SAMPLE_COUNT_1_BIT
SAMPLE_COUNT_4_BIT
maxColorAttachments = 8
sampledImageColorSampleCounts: count = 2
SAMPLE_COUNT_1_BIT
SAMPLE_COUNT_4_BIT
sampledImageIntegerSampleCounts: count = 2
SAMPLE_COUNT_1_BIT
SAMPLE_COUNT_4_BIT
sampledImageDepthSampleCounts: count = 2
SAMPLE_COUNT_1_BIT
SAMPLE_COUNT_4_BIT
sampledImageStencilSampleCounts: count = 2
SAMPLE_COUNT_1_BIT
SAMPLE_COUNT_4_BIT
storageImageSampleCounts: count = 2
SAMPLE_COUNT_1_BIT
SAMPLE_COUNT_4_BIT
maxSampleMaskWords = 1
timestampComputeAndGraphics = true
timestampPeriod = 1
maxClipDistances = 8
maxCullDistances = 8
maxCombinedClipAndCullDistances = 8
discreteQueuePriorities = 2
pointSizeRange: count = 2
0
255
lineWidthRange: count = 2
1
255
pointSizeGranularity = 0.125
lineWidthGranularity = 0.0078125
strictLines = true
standardSampleLocations = true
optimalBufferCopyOffsetAlignment = 0x00000080
optimalBufferCopyRowPitchAlignment = 0x00000080
nonCoherentAtomSize = 0x00000040
VkPhysicalDeviceSparseProperties:
---------------------------------
residencyStandard2DBlockShape = false
residencyStandard2DMultisampleBlockShape = false
residencyStandard3DBlockShape = false
residencyAlignedMipSize = false
residencyNonResidentStrict = false
VkPhysicalDeviceCustomBorderColorPropertiesEXT:
-----------------------------------------------
maxCustomBorderColorSamplers = 32768
VkPhysicalDeviceDepthStencilResolveProperties:
----------------------------------------------
supportedDepthResolveModes: count = 2
RESOLVE_MODE_SAMPLE_ZERO_BIT
RESOLVE_MODE_AVERAGE_BIT
supportedStencilResolveModes: count = 1
RESOLVE_MODE_SAMPLE_ZERO_BIT
independentResolveNone = false
independentResolve = false
VkPhysicalDeviceDescriptorIndexingProperties:
---------------------------------------------
maxUpdateAfterBindDescriptorsInAllPools = 4294967295
shaderUniformBufferArrayNonUniformIndexingNative = true
shaderSampledImageArrayNonUniformIndexingNative = true
shaderStorageBufferArrayNonUniformIndexingNative = true
shaderStorageImageArrayNonUniformIndexingNative = true
shaderInputAttachmentArrayNonUniformIndexingNative = true
robustBufferAccessUpdateAfterBind = true
quadDivergentImplicitLod = true
maxPerStageDescriptorUpdateAfterBindSamplers = 1000000
maxPerStageDescriptorUpdateAfterBindUniformBuffers = 1000000
maxPerStageDescriptorUpdateAfterBindStorageBuffers = 1000000
maxPerStageDescriptorUpdateAfterBindSampledImages = 1000000
maxPerStageDescriptorUpdateAfterBindStorageImages = 1000000
maxPerStageDescriptorUpdateAfterBindInputAttachments = 1000000
maxPerStageUpdateAfterBindResources = 1000000
maxDescriptorSetUpdateAfterBindSamplers = 1000000
maxDescriptorSetUpdateAfterBindUniformBuffers = 1000000
maxDescriptorSetUpdateAfterBindUniformBuffersDynamic = 1000000
maxDescriptorSetUpdateAfterBindStorageBuffers = 1000000
maxDescriptorSetUpdateAfterBindStorageBuffersDynamic = 1000000
maxDescriptorSetUpdateAfterBindSampledImages = 1000000
maxDescriptorSetUpdateAfterBindStorageImages = 1000000
maxDescriptorSetUpdateAfterBindInputAttachments = 1000000
VkPhysicalDeviceDriverProperties:
---------------------------------
driverID = DRIVER_ID_MESA_LLVMPIPE
driverName = llvmpipe
driverInfo = Mesa 23.2.1-1ubuntu3.1~22.04.2 (LLVM 15.0.7)
conformanceVersion = 1.3.1.1
VkPhysicalDeviceExternalMemoryHostPropertiesEXT:
------------------------------------------------
minImportedHostPointerAlignment = 0x00001000
VkPhysicalDeviceFloatControlsProperties:
----------------------------------------
denormBehaviorIndependence = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
roundingModeIndependence = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
shaderSignedZeroInfNanPreserveFloat16 = true
shaderSignedZeroInfNanPreserveFloat32 = true
shaderSignedZeroInfNanPreserveFloat64 = true
shaderDenormPreserveFloat16 = false
shaderDenormPreserveFloat32 = false
shaderDenormPreserveFloat64 = false
shaderDenormFlushToZeroFloat16 = false
shaderDenormFlushToZeroFloat32 = false
shaderDenormFlushToZeroFloat64 = false
shaderRoundingModeRTEFloat16 = true
shaderRoundingModeRTEFloat32 = true
shaderRoundingModeRTEFloat64 = true
shaderRoundingModeRTZFloat16 = false
shaderRoundingModeRTZFloat32 = false
shaderRoundingModeRTZFloat64 = false
VkPhysicalDeviceIDProperties:
-----------------------------
deviceUUID = 6d657361-3233-2e32-2e31-2d3175627500
driverUUID = 6c6c766d-7069-7065-5555-494400000000
deviceNodeMask = 0
deviceLUIDValid = false
VkPhysicalDeviceInlineUniformBlockProperties:
---------------------------------------------
maxInlineUniformBlockSize = 4096
maxPerStageDescriptorInlineUniformBlocks = 8
maxPerStageDescriptorUpdateAfterBindInlineUniformBlocks = 8
maxDescriptorSetInlineUniformBlocks = 8
maxDescriptorSetUpdateAfterBindInlineUniformBlocks = 8
VkPhysicalDeviceLineRasterizationPropertiesEXT:
-----------------------------------------------
lineSubPixelPrecisionBits = 8
VkPhysicalDeviceMaintenance3Properties:
---------------------------------------
maxPerSetDescriptors = 1000000
maxMemoryAllocationSize = 0x80000000
VkPhysicalDeviceMaintenance4Properties:
---------------------------------------
maxBufferSize = 0xffffffff
VkPhysicalDeviceMultiDrawPropertiesEXT:
---------------------------------------
maxMultiDrawCount = 2048
VkPhysicalDeviceMultiviewProperties:
------------------------------------
maxMultiviewViewCount = 6
maxMultiviewInstanceIndex = 2147483647
VkPhysicalDevicePointClippingProperties:
----------------------------------------
pointClippingBehavior = POINT_CLIPPING_BEHAVIOR_ALL_CLIP_PLANES
VkPhysicalDeviceProtectedMemoryProperties:
------------------------------------------
protectedNoFault = false
VkPhysicalDeviceProvokingVertexPropertiesEXT:
---------------------------------------------
provokingVertexModePerPipeline = true
transformFeedbackPreservesTriangleFanProvokingVertex = true
VkPhysicalDevicePushDescriptorPropertiesKHR:
--------------------------------------------
maxPushDescriptors = 32
VkPhysicalDeviceRobustness2PropertiesEXT:
-----------------------------------------
robustStorageBufferAccessSizeAlignment = 0x00000001
robustUniformBufferAccessSizeAlignment = 0x00000001
VkPhysicalDeviceSamplerFilterMinmaxProperties:
----------------------------------------------
filterMinmaxSingleComponentFormats = true
filterMinmaxImageComponentMapping = true
VkPhysicalDeviceShaderIntegerDotProductProperties:
--------------------------------------------------
integerDotProduct8BitUnsignedAccelerated = false
integerDotProduct8BitSignedAccelerated = false
integerDotProduct8BitMixedSignednessAccelerated = false
integerDotProduct4x8BitPackedUnsignedAccelerated = false
integerDotProduct4x8BitPackedSignedAccelerated = false
integerDotProduct4x8BitPackedMixedSignednessAccelerated = false
integerDotProduct16BitUnsignedAccelerated = false
integerDotProduct16BitSignedAccelerated = false
integerDotProduct16BitMixedSignednessAccelerated = false
integerDotProduct32BitUnsignedAccelerated = false
integerDotProduct32BitSignedAccelerated = false
integerDotProduct32BitMixedSignednessAccelerated = false
integerDotProduct64BitUnsignedAccelerated = false
integerDotProduct64BitSignedAccelerated = false
integerDotProduct64BitMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating8BitUnsignedAccelerated = false
integerDotProductAccumulatingSaturating8BitSignedAccelerated = false
integerDotProductAccumulatingSaturating8BitMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating4x8BitPackedUnsignedAccelerated = false
integerDotProductAccumulatingSaturating4x8BitPackedSignedAccelerated = false
integerDotProductAccumulatingSaturating4x8BitPackedMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating16BitUnsignedAccelerated = false
integerDotProductAccumulatingSaturating16BitSignedAccelerated = false
integerDotProductAccumulatingSaturating16BitMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating32BitUnsignedAccelerated = false
integerDotProductAccumulatingSaturating32BitSignedAccelerated = false
integerDotProductAccumulatingSaturating32BitMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating64BitUnsignedAccelerated = false
integerDotProductAccumulatingSaturating64BitSignedAccelerated = false
integerDotProductAccumulatingSaturating64BitMixedSignednessAccelerated = false
VkPhysicalDeviceSubgroupProperties:
-----------------------------------
subgroupSize = 4
supportedStages: count = 6
SHADER_STAGE_FRAGMENT_BIT
SHADER_STAGE_COMPUTE_BIT
SHADER_STAGE_ALL_GRAPHICS
SHADER_STAGE_ALL
SHADER_STAGE_TASK_BIT_NV
SHADER_STAGE_MESH_BIT_NV
supportedOperations: count = 7
SUBGROUP_FEATURE_BASIC_BIT
SUBGROUP_FEATURE_VOTE_BIT
SUBGROUP_FEATURE_ARITHMETIC_BIT
SUBGROUP_FEATURE_BALLOT_BIT
SUBGROUP_FEATURE_SHUFFLE_BIT
SUBGROUP_FEATURE_SHUFFLE_RELATIVE_BIT
SUBGROUP_FEATURE_QUAD_BIT
quadOperationsInAllStages = false
VkPhysicalDeviceSubgroupSizeControlProperties:
----------------------------------------------
minSubgroupSize = 4
maxSubgroupSize = 4
maxComputeWorkgroupSubgroups = 32
requiredSubgroupSizeStages: count = 4
SHADER_STAGE_FRAGMENT_BIT
SHADER_STAGE_COMPUTE_BIT
SHADER_STAGE_ALL_GRAPHICS
SHADER_STAGE_ALL
VkPhysicalDeviceTexelBufferAlignmentProperties:
-----------------------------------------------
storageTexelBufferOffsetAlignmentBytes = 0x00000010
storageTexelBufferOffsetSingleTexelAlignment = true
uniformTexelBufferOffsetAlignmentBytes = 0x00000010
uniformTexelBufferOffsetSingleTexelAlignment = true
VkPhysicalDeviceTimelineSemaphoreProperties:
--------------------------------------------
maxTimelineSemaphoreValueDifference = 18446744073709551615
VkPhysicalDeviceTransformFeedbackPropertiesEXT:
-----------------------------------------------
maxTransformFeedbackStreams = 4
maxTransformFeedbackBuffers = 4
maxTransformFeedbackBufferSize = 0xffffffff
maxTransformFeedbackStreamDataSize = 512
maxTransformFeedbackBufferDataSize = 512
maxTransformFeedbackBufferDataStride = 512
transformFeedbackQueries = true
transformFeedbackStreamsLinesTriangles = false
transformFeedbackRasterizationStreamSelect = false
transformFeedbackDraw = true
VkPhysicalDeviceVertexAttributeDivisorPropertiesEXT:
----------------------------------------------------
maxVertexAttribDivisor = 4294967295
VkPhysicalDeviceVulkan11Properties:
-----------------------------------
deviceUUID = 6d657361-3233-2e32-2e31-2d3175627500
driverUUID = 6c6c766d-7069-7065-5555-494400000000
deviceNodeMask = 0
deviceLUIDValid = false
subgroupSize = 4
subgroupSupportedStages: count = 6
SHADER_STAGE_FRAGMENT_BIT
SHADER_STAGE_COMPUTE_BIT
SHADER_STAGE_ALL_GRAPHICS
SHADER_STAGE_ALL
SHADER_STAGE_TASK_BIT_NV
SHADER_STAGE_MESH_BIT_NV
subgroupSupportedOperations: count = 7
SUBGROUP_FEATURE_BASIC_BIT
SUBGROUP_FEATURE_VOTE_BIT
SUBGROUP_FEATURE_ARITHMETIC_BIT
SUBGROUP_FEATURE_BALLOT_BIT
SUBGROUP_FEATURE_SHUFFLE_BIT
SUBGROUP_FEATURE_SHUFFLE_RELATIVE_BIT
SUBGROUP_FEATURE_QUAD_BIT
subgroupQuadOperationsInAllStages = false
pointClippingBehavior = POINT_CLIPPING_BEHAVIOR_ALL_CLIP_PLANES
maxMultiviewViewCount = 6
maxMultiviewInstanceIndex = 2147483647
protectedNoFault = false
maxPerSetDescriptors = 1000000
maxMemoryAllocationSize = 0x80000000
VkPhysicalDeviceVulkan12Properties:
-----------------------------------
driverID = DRIVER_ID_MESA_LLVMPIPE
driverName = llvmpipe
driverInfo = Mesa 23.2.1-1ubuntu3.1~22.04.2 (LLVM 15.0.7)
conformanceVersion = 1.3.1.1
denormBehaviorIndependence = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
roundingModeIndependence = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
shaderSignedZeroInfNanPreserveFloat16 = true
shaderSignedZeroInfNanPreserveFloat32 = true
shaderSignedZeroInfNanPreserveFloat64 = true
shaderDenormPreserveFloat16 = false
shaderDenormPreserveFloat32 = false
shaderDenormPreserveFloat64 = false
shaderDenormFlushToZeroFloat16 = false
shaderDenormFlushToZeroFloat32 = false
shaderDenormFlushToZeroFloat64 = false
shaderRoundingModeRTEFloat16 = true
shaderRoundingModeRTEFloat32 = true
shaderRoundingModeRTEFloat64 = true
shaderRoundingModeRTZFloat16 = false
shaderRoundingModeRTZFloat32 = false
shaderRoundingModeRTZFloat64 = false
maxUpdateAfterBindDescriptorsInAllPools = 4294967295
shaderUniformBufferArrayNonUniformIndexingNative = true
shaderSampledImageArrayNonUniformIndexingNative = true
shaderStorageBufferArrayNonUniformIndexingNative = true
shaderStorageImageArrayNonUniformIndexingNative = true
shaderInputAttachmentArrayNonUniformIndexingNative = true
robustBufferAccessUpdateAfterBind = true
quadDivergentImplicitLod = true
maxPerStageDescriptorUpdateAfterBindSamplers = 1000000
maxPerStageDescriptorUpdateAfterBindUniformBuffers = 1000000
maxPerStageDescriptorUpdateAfterBindStorageBuffers = 1000000
maxPerStageDescriptorUpdateAfterBindSampledImages = 1000000
maxPerStageDescriptorUpdateAfterBindStorageImages = 1000000
maxPerStageDescriptorUpdateAfterBindInputAttachments = 1000000
maxPerStageUpdateAfterBindResources = 1000000
maxDescriptorSetUpdateAfterBindSamplers = 1000000
maxDescriptorSetUpdateAfterBindUniformBuffers = 1000000
maxDescriptorSetUpdateAfterBindUniformBuffersDynamic = 1000000
maxDescriptorSetUpdateAfterBindStorageBuffers = 1000000
maxDescriptorSetUpdateAfterBindStorageBuffersDynamic = 1000000
maxDescriptorSetUpdateAfterBindSampledImages = 1000000
maxDescriptorSetUpdateAfterBindStorageImages = 1000000
maxDescriptorSetUpdateAfterBindInputAttachments = 1000000
supportedDepthResolveModes: count = 2
RESOLVE_MODE_SAMPLE_ZERO_BIT
RESOLVE_MODE_AVERAGE_BIT
supportedStencilResolveModes: count = 1
RESOLVE_MODE_SAMPLE_ZERO_BIT
independentResolveNone = false
independentResolve = false
filterMinmaxSingleComponentFormats = true
filterMinmaxImageComponentMapping = true
maxTimelineSemaphoreValueDifference = 18446744073709551615
framebufferIntegerColorSampleCounts: count = 1
SAMPLE_COUNT_1_BIT
VkPhysicalDeviceVulkan13Properties:
-----------------------------------
minSubgroupSize = 4
maxSubgroupSize = 4
maxComputeWorkgroupSubgroups = 32
requiredSubgroupSizeStages: count = 4
SHADER_STAGE_FRAGMENT_BIT
SHADER_STAGE_COMPUTE_BIT
SHADER_STAGE_ALL_GRAPHICS
SHADER_STAGE_ALL
maxInlineUniformBlockSize = 4096
maxPerStageDescriptorInlineUniformBlocks = 8
maxPerStageDescriptorUpdateAfterBindInlineUniformBlocks = 8
maxDescriptorSetInlineUniformBlocks = 8
maxDescriptorSetUpdateAfterBindInlineUniformBlocks = 8
maxInlineUniformTotalSize = 262144
integerDotProduct8BitUnsignedAccelerated = false
integerDotProduct8BitSignedAccelerated = false
integerDotProduct8BitMixedSignednessAccelerated = false
integerDotProduct4x8BitPackedUnsignedAccelerated = false
integerDotProduct4x8BitPackedSignedAccelerated = false
integerDotProduct4x8BitPackedMixedSignednessAccelerated = false
integerDotProduct16BitUnsignedAccelerated = false
integerDotProduct16BitSignedAccelerated = false
integerDotProduct16BitMixedSignednessAccelerated = false
integerDotProduct32BitUnsignedAccelerated = false
integerDotProduct32BitSignedAccelerated = false
integerDotProduct32BitMixedSignednessAccelerated = false
integerDotProduct64BitUnsignedAccelerated = false
integerDotProduct64BitSignedAccelerated = false
integerDotProduct64BitMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating8BitUnsignedAccelerated = false
integerDotProductAccumulatingSaturating8BitSignedAccelerated = false
integerDotProductAccumulatingSaturating8BitMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating4x8BitPackedUnsignedAccelerated = false
integerDotProductAccumulatingSaturating4x8BitPackedSignedAccelerated = false
integerDotProductAccumulatingSaturating4x8BitPackedMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating16BitUnsignedAccelerated = false
integerDotProductAccumulatingSaturating16BitSignedAccelerated = false
integerDotProductAccumulatingSaturating16BitMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating32BitUnsignedAccelerated = false
integerDotProductAccumulatingSaturating32BitSignedAccelerated = false
integerDotProductAccumulatingSaturating32BitMixedSignednessAccelerated = false
integerDotProductAccumulatingSaturating64BitUnsignedAccelerated = false
integerDotProductAccumulatingSaturating64BitSignedAccelerated = false
integerDotProductAccumulatingSaturating64BitMixedSignednessAccelerated = false
storageTexelBufferOffsetAlignmentBytes = 0x00000010
storageTexelBufferOffsetSingleTexelAlignment = true
uniformTexelBufferOffsetAlignmentBytes = 0x00000010
uniformTexelBufferOffsetSingleTexelAlignment = true
maxBufferSize = 0xffffffff
Device Extensions: count = 114
VK_ARM_rasterization_order_attachment_access : extension revision 1
VK_EXT_4444_formats : extension revision 1
VK_EXT_attachment_feedback_loop_dynamic_state : extension revision 1
VK_EXT_attachment_feedback_loop_layout : extension revision 2
VK_EXT_border_color_swizzle : extension revision 1
VK_EXT_calibrated_timestamps : extension revision 2
VK_EXT_color_write_enable : extension revision 1
VK_EXT_conditional_rendering : extension revision 2
VK_EXT_custom_border_color : extension revision 12
VK_EXT_depth_clip_control : extension revision 1
VK_EXT_depth_clip_enable : extension revision 1
VK_EXT_depth_range_unrestricted : extension revision 1
VK_EXT_descriptor_buffer : extension revision 1
VK_EXT_descriptor_indexing : extension revision 2
VK_EXT_dynamic_rendering_unused_attachments : extension revision 1
VK_EXT_extended_dynamic_state : extension revision 1
VK_EXT_extended_dynamic_state2 : extension revision 1
VK_EXT_extended_dynamic_state3 : extension revision 2
VK_EXT_external_memory_host : extension revision 1
VK_EXT_graphics_pipeline_library : extension revision 1
VK_EXT_host_query_reset : extension revision 1
VK_EXT_image_2d_view_of_3d : extension revision 1
VK_EXT_image_robustness : extension revision 1
VK_EXT_image_sliced_view_of_3d : extension revision 1
VK_EXT_index_type_uint8 : extension revision 1
VK_EXT_inline_uniform_block : extension revision 1
VK_EXT_line_rasterization : extension revision 1
VK_EXT_memory_budget : extension revision 1
VK_EXT_memory_priority : extension revision 1
VK_EXT_mesh_shader : extension revision 1
VK_EXT_multi_draw : extension revision 1
VK_EXT_multisampled_render_to_single_sampled : extension revision 1
VK_EXT_mutable_descriptor_type : extension revision 1
VK_EXT_non_seamless_cube_map : extension revision 1
VK_EXT_pageable_device_local_memory : extension revision 1
VK_EXT_pipeline_creation_cache_control : extension revision 3
VK_EXT_pipeline_creation_feedback : extension revision 1
VK_EXT_post_depth_coverage : extension revision 1
VK_EXT_primitive_topology_list_restart : extension revision 1
VK_EXT_primitives_generated_query : extension revision 1
VK_EXT_private_data : extension revision 1
VK_EXT_provoking_vertex : extension revision 1
VK_EXT_rasterization_order_attachment_access : extension revision 1
VK_EXT_robustness2 : extension revision 1
VK_EXT_sampler_filter_minmax : extension revision 2
VK_EXT_scalar_block_layout : extension revision 1
VK_EXT_separate_stencil_usage : extension revision 1
VK_EXT_shader_atomic_float : extension revision 1
VK_EXT_shader_atomic_float2 : extension revision 1
VK_EXT_shader_demote_to_helper_invocation : extension revision 1
VK_EXT_shader_object : extension revision 1
VK_EXT_shader_stencil_export : extension revision 1
VK_EXT_shader_subgroup_ballot : extension revision 1
VK_EXT_shader_subgroup_vote : extension revision 1
VK_EXT_shader_viewport_index_layer : extension revision 1
VK_EXT_subgroup_size_control : extension revision 2
VK_EXT_texel_buffer_alignment : extension revision 1
VK_EXT_transform_feedback : extension revision 1
VK_EXT_vertex_attribute_divisor : extension revision 3
VK_EXT_vertex_input_dynamic_state : extension revision 2
VK_GOOGLE_decorate_string : extension revision 1
VK_GOOGLE_hlsl_functionality1 : extension revision 1
VK_KHR_16bit_storage : extension revision 1
VK_KHR_8bit_storage : extension revision 1
VK_KHR_bind_memory2 : extension revision 1
VK_KHR_buffer_device_address : extension revision 1
VK_KHR_copy_commands2 : extension revision 1
VK_KHR_create_renderpass2 : extension revision 1
VK_KHR_dedicated_allocation : extension revision 3
VK_KHR_depth_stencil_resolve : extension revision 1
VK_KHR_descriptor_update_template : extension revision 1
VK_KHR_device_group : extension revision 4
VK_KHR_draw_indirect_count : extension revision 1
VK_KHR_driver_properties : extension revision 1
VK_KHR_dynamic_rendering : extension revision 1
VK_KHR_external_fence : extension revision 1
VK_KHR_external_memory : extension revision 1
VK_KHR_external_memory_fd : extension revision 1
VK_KHR_external_semaphore : extension revision 1
VK_KHR_format_feature_flags2 : extension revision 2
VK_KHR_get_memory_requirements2 : extension revision 1
VK_KHR_image_format_list : extension revision 1
VK_KHR_imageless_framebuffer : extension revision 1
VK_KHR_incremental_present : extension revision 2
VK_KHR_maintenance1 : extension revision 2
VK_KHR_maintenance2 : extension revision 1
VK_KHR_maintenance3 : extension revision 1
VK_KHR_maintenance4 : extension revision 2
VK_KHR_multiview : extension revision 1
VK_KHR_pipeline_library : extension revision 1
VK_KHR_push_descriptor : extension revision 2
VK_KHR_relaxed_block_layout : extension revision 1
VK_KHR_sampler_mirror_clamp_to_edge : extension revision 3
VK_KHR_separate_depth_stencil_layouts : extension revision 1
VK_KHR_shader_atomic_int64 : extension revision 1
VK_KHR_shader_clock : extension revision 1
VK_KHR_shader_draw_parameters : extension revision 1
VK_KHR_shader_float16_int8 : extension revision 1
VK_KHR_shader_float_controls : extension revision 4
VK_KHR_shader_integer_dot_product : extension revision 1
VK_KHR_shader_non_semantic_info : extension revision 1
VK_KHR_shader_subgroup_extended_types : extension revision 1
VK_KHR_shader_terminate_invocation : extension revision 1
VK_KHR_spirv_1_4 : extension revision 1
VK_KHR_storage_buffer_storage_class : extension revision 1
VK_KHR_swapchain : extension revision 70
VK_KHR_swapchain_mutable_format : extension revision 1
VK_KHR_synchronization2 : extension revision 1
VK_KHR_timeline_semaphore : extension revision 2
VK_KHR_uniform_buffer_standard_layout : extension revision 1
VK_KHR_variable_pointers : extension revision 1
VK_KHR_vulkan_memory_model : extension revision 3
VK_KHR_zero_initialize_workgroup_memory : extension revision 1
VK_NV_device_generated_commands : extension revision 3
VkQueueFamilyProperties:
========================
queueProperties[0]:
-------------------
minImageTransferGranularity = (1,1,1)
queueCount = 1
queueFlags = QUEUE_GRAPHICS | QUEUE_COMPUTE | QUEUE_TRANSFER
timestampValidBits = 64
present support = false
VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 1
memoryHeaps[0]:
size = 949276672 (0x3894d000) (905.30 MiB)
budget = 949276672 (0x3894d000) (905.30 MiB)
usage = 264392704 (0x0fc25000) (252.14 MiB)
flags: count = 1
MEMORY_HEAP_DEVICE_LOCAL_BIT
memoryTypes: count = 1
memoryTypes[0]:
heapIndex = 0
propertyFlags = 0x000f: count = 4
MEMORY_PROPERTY_DEVICE_LOCAL_BIT
MEMORY_PROPERTY_HOST_VISIBLE_BIT
MEMORY_PROPERTY_HOST_COHERENT_BIT
MEMORY_PROPERTY_HOST_CACHED_BIT
usable for:
IMAGE_TILING_OPTIMAL:
color images
FORMAT_D16_UNORM
FORMAT_X8_D24_UNORM_PACK32
FORMAT_D32_SFLOAT
FORMAT_S8_UINT
FORMAT_D24_UNORM_S8_UINT
FORMAT_D32_SFLOAT_S8_UINT
(non-sparse)
IMAGE_TILING_LINEAR:
color images
(non-sparse)
VkPhysicalDeviceFeatures:
=========================
robustBufferAccess = true
fullDrawIndexUint32 = true
imageCubeArray = true
independentBlend = true
geometryShader = true
tessellationShader = true
sampleRateShading = true
dualSrcBlend = true
logicOp = true
multiDrawIndirect = true
drawIndirectFirstInstance = true
depthClamp = true
depthBiasClamp = true
fillModeNonSolid = true
depthBounds = false
wideLines = true
largePoints = true
alphaToOne = true
multiViewport = true
samplerAnisotropy = true
textureCompressionETC2 = false
textureCompressionASTC_LDR = false
textureCompressionBC = true
occlusionQueryPrecise = true
pipelineStatisticsQuery = true
vertexPipelineStoresAndAtomics = true
fragmentStoresAndAtomics = true
shaderTessellationAndGeometryPointSize = true
shaderImageGatherExtended = true
shaderStorageImageExtendedFormats = true
shaderStorageImageMultisample = true
shaderStorageImageReadWithoutFormat = true
shaderStorageImageWriteWithoutFormat = true
shaderUniformBufferArrayDynamicIndexing = true
shaderSampledImageArrayDynamicIndexing = true
shaderStorageBufferArrayDynamicIndexing = true
shaderStorageImageArrayDynamicIndexing = true
shaderClipDistance = true
shaderCullDistance = true
shaderFloat64 = true
shaderInt64 = true
shaderInt16 = true
shaderResourceResidency = false
shaderResourceMinLod = false
sparseBinding = false
sparseResidencyBuffer = false
sparseResidencyImage2D = false
sparseResidencyImage3D = false
sparseResidency2Samples = false
sparseResidency4Samples = false
sparseResidency8Samples = false
sparseResidency16Samples = false
sparseResidencyAliased = false
variableMultisampleRate = false
inheritedQueries = false
VkPhysicalDevice16BitStorageFeatures:
-------------------------------------
storageBuffer16BitAccess = true
uniformAndStorageBuffer16BitAccess = true
storagePushConstant16 = true
storageInputOutput16 = false
VkPhysicalDevice4444FormatsFeaturesEXT:
---------------------------------------
formatA4R4G4B4 = true
formatA4B4G4R4 = true
VkPhysicalDevice8BitStorageFeatures:
------------------------------------
storageBuffer8BitAccess = true
uniformAndStorageBuffer8BitAccess = true
storagePushConstant8 = true
VkPhysicalDeviceBorderColorSwizzleFeaturesEXT:
----------------------------------------------
borderColorSwizzle = true
borderColorSwizzleFromImage = true
VkPhysicalDeviceBufferDeviceAddressFeatures:
--------------------------------------------
bufferDeviceAddress = true
bufferDeviceAddressCaptureReplay = false
bufferDeviceAddressMultiDevice = false
VkPhysicalDeviceColorWriteEnableFeaturesEXT:
--------------------------------------------
colorWriteEnable = true
VkPhysicalDeviceConditionalRenderingFeaturesEXT:
------------------------------------------------
conditionalRendering = true
inheritedConditionalRendering = false
VkPhysicalDeviceCustomBorderColorFeaturesEXT:
---------------------------------------------
customBorderColors = true
customBorderColorWithoutFormat = true
VkPhysicalDeviceDepthClipControlFeaturesEXT:
--------------------------------------------
depthClipControl = true
VkPhysicalDeviceDepthClipEnableFeaturesEXT:
-------------------------------------------
depthClipEnable = true
VkPhysicalDeviceDescriptorIndexingFeatures:
-------------------------------------------
shaderInputAttachmentArrayDynamicIndexing = true
shaderUniformTexelBufferArrayDynamicIndexing = true
shaderStorageTexelBufferArrayDynamicIndexing = true
shaderUniformBufferArrayNonUniformIndexing = true
shaderSampledImageArrayNonUniformIndexing = true
shaderStorageBufferArrayNonUniformIndexing = true
shaderStorageImageArrayNonUniformIndexing = true
shaderInputAttachmentArrayNonUniformIndexing = true
shaderUniformTexelBufferArrayNonUniformIndexing = true
shaderStorageTexelBufferArrayNonUniformIndexing = true
descriptorBindingUniformBufferUpdateAfterBind = true
descriptorBindingSampledImageUpdateAfterBind = true
descriptorBindingStorageImageUpdateAfterBind = true
descriptorBindingStorageBufferUpdateAfterBind = true
descriptorBindingUniformTexelBufferUpdateAfterBind = true
descriptorBindingStorageTexelBufferUpdateAfterBind = true
descriptorBindingUpdateUnusedWhilePending = true
descriptorBindingPartiallyBound = true
descriptorBindingVariableDescriptorCount = true
runtimeDescriptorArray = true
VkPhysicalDeviceDynamicRenderingFeatures:
-----------------------------------------
dynamicRendering = true
VkPhysicalDeviceExtendedDynamicState2FeaturesEXT:
-------------------------------------------------
extendedDynamicState2 = true
extendedDynamicState2LogicOp = true
extendedDynamicState2PatchControlPoints = true
VkPhysicalDeviceExtendedDynamicStateFeaturesEXT:
------------------------------------------------
extendedDynamicState = true
VkPhysicalDeviceHostQueryResetFeatures:
---------------------------------------
hostQueryReset = true
VkPhysicalDeviceImageRobustnessFeatures:
----------------------------------------
robustImageAccess = true
VkPhysicalDeviceImagelessFramebufferFeatures:
---------------------------------------------
imagelessFramebuffer = true
VkPhysicalDeviceIndexTypeUint8FeaturesEXT:
------------------------------------------
indexTypeUint8 = true
VkPhysicalDeviceInlineUniformBlockFeatures:
-------------------------------------------
inlineUniformBlock = true
descriptorBindingInlineUniformBlockUpdateAfterBind = true
VkPhysicalDeviceLineRasterizationFeaturesEXT:
---------------------------------------------
rectangularLines = true
bresenhamLines = true
smoothLines = true
stippledRectangularLines = true
stippledBresenhamLines = true
stippledSmoothLines = true
VkPhysicalDeviceMaintenance4Features:
-------------------------------------
maintenance4 = true
VkPhysicalDeviceMemoryPriorityFeaturesEXT:
------------------------------------------
memoryPriority = true
VkPhysicalDeviceMultiDrawFeaturesEXT:
-------------------------------------
multiDraw = true
VkPhysicalDeviceMultiviewFeatures:
----------------------------------
multiview = true
multiviewGeometryShader = true
multiviewTessellationShader = true
VkPhysicalDevicePageableDeviceLocalMemoryFeaturesEXT:
-----------------------------------------------------
pageableDeviceLocalMemory = true
VkPhysicalDevicePipelineCreationCacheControlFeatures:
-----------------------------------------------------
pipelineCreationCacheControl = true
VkPhysicalDevicePrimitiveTopologyListRestartFeaturesEXT:
--------------------------------------------------------
primitiveTopologyListRestart = true
primitiveTopologyPatchListRestart = true
VkPhysicalDevicePrivateDataFeatures:
------------------------------------
privateData = true
VkPhysicalDeviceProtectedMemoryFeatures:
----------------------------------------
protectedMemory = false
VkPhysicalDeviceProvokingVertexFeaturesEXT:
-------------------------------------------
provokingVertexLast = true
transformFeedbackPreservesProvokingVertex = true
VkPhysicalDeviceRobustness2FeaturesEXT:
---------------------------------------
robustBufferAccess2 = true
robustImageAccess2 = true
nullDescriptor = true
VkPhysicalDeviceSamplerYcbcrConversionFeatures:
-----------------------------------------------
samplerYcbcrConversion = false
VkPhysicalDeviceScalarBlockLayoutFeatures:
------------------------------------------
scalarBlockLayout = true
VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures:
----------------------------------------------------
separateDepthStencilLayouts = true
VkPhysicalDeviceShaderAtomicFloat2FeaturesEXT:
----------------------------------------------
shaderBufferFloat16Atomics = false
shaderBufferFloat16AtomicAdd = false
shaderBufferFloat16AtomicMinMax = false
shaderBufferFloat32AtomicMinMax = true
shaderBufferFloat64AtomicMinMax = false
shaderSharedFloat16Atomics = false
shaderSharedFloat16AtomicAdd = false
shaderSharedFloat16AtomicMinMax = false
shaderSharedFloat32AtomicMinMax = true
shaderSharedFloat64AtomicMinMax = false
shaderImageFloat32AtomicMinMax = true
sparseImageFloat32AtomicMinMax = false
VkPhysicalDeviceShaderAtomicFloatFeaturesEXT:
---------------------------------------------
shaderBufferFloat32Atomics = true
shaderBufferFloat32AtomicAdd = true
shaderBufferFloat64Atomics = false
shaderBufferFloat64AtomicAdd = false
shaderSharedFloat32Atomics = true
shaderSharedFloat32AtomicAdd = true
shaderSharedFloat64Atomics = false
shaderSharedFloat64AtomicAdd = false
shaderImageFloat32Atomics = true
shaderImageFloat32AtomicAdd = true
sparseImageFloat32Atomics = false
sparseImageFloat32AtomicAdd = false
VkPhysicalDeviceShaderAtomicInt64Features:
------------------------------------------
shaderBufferInt64Atomics = true
shaderSharedInt64Atomics = true
VkPhysicalDeviceShaderClockFeaturesKHR:
---------------------------------------
shaderSubgroupClock = true
shaderDeviceClock = true
VkPhysicalDeviceShaderDemoteToHelperInvocationFeatures:
-------------------------------------------------------
shaderDemoteToHelperInvocation = true
VkPhysicalDeviceShaderDrawParametersFeatures:
---------------------------------------------
shaderDrawParameters = true
VkPhysicalDeviceShaderFloat16Int8Features:
------------------------------------------
shaderFloat16 = true
shaderInt8 = true
VkPhysicalDeviceShaderIntegerDotProductFeatures:
------------------------------------------------
shaderIntegerDotProduct = true
VkPhysicalDeviceShaderSubgroupExtendedTypesFeatures:
----------------------------------------------------
shaderSubgroupExtendedTypes = true
VkPhysicalDeviceShaderTerminateInvocationFeatures:
--------------------------------------------------
shaderTerminateInvocation = true
VkPhysicalDeviceSubgroupSizeControlFeatures:
--------------------------------------------
subgroupSizeControl = true
computeFullSubgroups = true
VkPhysicalDeviceSynchronization2Features:
-----------------------------------------
synchronization2 = true
VkPhysicalDeviceTexelBufferAlignmentFeaturesEXT:
------------------------------------------------
texelBufferAlignment = true
VkPhysicalDeviceTextureCompressionASTCHDRFeatures:
--------------------------------------------------
textureCompressionASTC_HDR = false
VkPhysicalDeviceTimelineSemaphoreFeatures:
------------------------------------------
timelineSemaphore = true
VkPhysicalDeviceTransformFeedbackFeaturesEXT:
---------------------------------------------
transformFeedback = true
geometryStreams = true
VkPhysicalDeviceUniformBufferStandardLayoutFeatures:
----------------------------------------------------
uniformBufferStandardLayout = true
VkPhysicalDeviceVariablePointersFeatures:
-----------------------------------------
variablePointersStorageBuffer = true
variablePointers = true
VkPhysicalDeviceVertexAttributeDivisorFeaturesEXT:
--------------------------------------------------
vertexAttributeInstanceRateDivisor = true
vertexAttributeInstanceRateZeroDivisor = true
VkPhysicalDeviceVertexInputDynamicStateFeaturesEXT:
---------------------------------------------------
vertexInputDynamicState = true
VkPhysicalDeviceVulkan11Features:
---------------------------------
storageBuffer16BitAccess = true
uniformAndStorageBuffer16BitAccess = true
storagePushConstant16 = true
storageInputOutput16 = false
multiview = true
multiviewGeometryShader = true
multiviewTessellationShader = true
variablePointersStorageBuffer = true
variablePointers = true
protectedMemory = false
samplerYcbcrConversion = false
shaderDrawParameters = true
VkPhysicalDeviceVulkan12Features:
---------------------------------
samplerMirrorClampToEdge = true
drawIndirectCount = true
storageBuffer8BitAccess = true
uniformAndStorageBuffer8BitAccess = true
storagePushConstant8 = true
shaderBufferInt64Atomics = true
shaderSharedInt64Atomics = true
shaderFloat16 = true
shaderInt8 = true
descriptorIndexing = true
shaderInputAttachmentArrayDynamicIndexing = true
shaderUniformTexelBufferArrayDynamicIndexing = true
shaderStorageTexelBufferArrayDynamicIndexing = true
shaderUniformBufferArrayNonUniformIndexing = true
shaderSampledImageArrayNonUniformIndexing = true
shaderStorageBufferArrayNonUniformIndexing = true
shaderStorageImageArrayNonUniformIndexing = true
shaderInputAttachmentArrayNonUniformIndexing = true
shaderUniformTexelBufferArrayNonUniformIndexing = true
shaderStorageTexelBufferArrayNonUniformIndexing = true
descriptorBindingUniformBufferUpdateAfterBind = true
descriptorBindingSampledImageUpdateAfterBind = true
descriptorBindingStorageImageUpdateAfterBind = true
descriptorBindingStorageBufferUpdateAfterBind = true
descriptorBindingUniformTexelBufferUpdateAfterBind = true
descriptorBindingStorageTexelBufferUpdateAfterBind = true
descriptorBindingUpdateUnusedWhilePending = true
descriptorBindingPartiallyBound = true
descriptorBindingVariableDescriptorCount = true
runtimeDescriptorArray = true
samplerFilterMinmax = true
scalarBlockLayout = true
imagelessFramebuffer = true
uniformBufferStandardLayout = true
shaderSubgroupExtendedTypes = true
separateDepthStencilLayouts = true
hostQueryReset = true
timelineSemaphore = true
bufferDeviceAddress = true
bufferDeviceAddressCaptureReplay = false
bufferDeviceAddressMultiDevice = false
vulkanMemoryModel = true
vulkanMemoryModelDeviceScope = true
vulkanMemoryModelAvailabilityVisibilityChains = true
shaderOutputViewportIndex = true
shaderOutputLayer = true
subgroupBroadcastDynamicId = true
VkPhysicalDeviceVulkan13Features:
---------------------------------
robustImageAccess = true
inlineUniformBlock = true
descriptorBindingInlineUniformBlockUpdateAfterBind = true
pipelineCreationCacheControl = true
privateData = true
shaderDemoteToHelperInvocation = true
shaderTerminateInvocation = true
subgroupSizeControl = true
computeFullSubgroups = true
synchronization2 = true
textureCompressionASTC_HDR = false
shaderZeroInitializeWorkgroupMemory = true
dynamicRendering = true
shaderIntegerDotProduct = true
maintenance4 = true
VkPhysicalDeviceVulkanMemoryModelFeatures:
------------------------------------------
vulkanMemoryModel = true
vulkanMemoryModelDeviceScope = true
vulkanMemoryModelAvailabilityVisibilityChains = true
VkPhysicalDeviceZeroInitializeWorkgroupMemoryFeatures:
------------------------------------------------------
shaderZeroInitializeWorkgroupMemory = true
Nice, so the raspberry pi should be able to support it as well, I will need to adjust workgroup size I think to max 3 from what I see.
sudo nice -n -20 ./main inference --model ..dllama_original_q40.bin --tokenizer ..dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 1 💡 arch: llama2 💡 dim: 4096 💡 hiddenDim: 14336 💡 nLayers: 32 💡 nHeads: 32 💡 nKvHeads: 8 💡 vocabSize: 128256 💡 seqLen: 2048 💡 nSlices: 1 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128001 🕒 ropeCache: 32768 kB ⏩ Loaded 6175568 kB Created Vulkan Instance! Device Name: NVIDIA GeForce RTX 3060 API Version: 1.3.242 Memory Heaps: 2 Heap 0: 12288 MB Heap 1: 24004 MB Memory Types: 5 Type 0: 1 Type 1: 0 Device Local Type 2: 1 Host Visible Host Coherent Type 3: 1 Host Visible Host Coherent Type 4: 0 Device Local Host Visible Host Coherent Created pipeline F32_F32 Created pipeline Q40_Q80 🔶 G 1307 ms I 1307 ms T 0 ms S 0 kB R 0 kB Hello 🔶 G 1315 ms I 1315 ms T 0 ms S 0 kB R 0 kB world 🔶 G 1384 ms I 1382 ms T 0 ms S 0 kB R 0 kB OO 🔶 G 1520 ms I 1518 ms T 0 ms S 0 kB R 0 kB AAAAAAAA 🔶 G 1595 ms I 1591 ms T 0 ms S 0 kB R 0 kB gambar 🔶 G 1583 ms I 1580 ms T 0 ms S 0 kB R 0 kB and 🔶 G 1605 ms I 1600 ms T 0 ms S 0 kB R 0 kB HUD 🔶 G 1623 ms I 1618 ms T 0 ms S 0 kB R 0 kB Sm 🔶 G 1616 ms I 1612 ms T 0 ms S 0 kB R 0 kB AG 🔶 G 1561 ms I 1556 ms T 0 ms S 0 kB R 0 kB asz 🔶 G 1578 ms I 1573 ms T 0 ms S 0 kB R 0 kB asse 🔶 G 1598 ms I 1593 ms T 0 ms S 0 kB R 0 kB imagination
Not quite there yet it seems
It looks like the best approach would be to determine how many layers of weights can be loaded onto GPU memory and if a layer is on GPU memory then process it in Vulkan, else via CPU. Reason why I say this is that I'm pretty sure now that the process of loading the weights/input to GPU memory, doing the calculation and then getting the results back will in most cases be slower than just doing it on the CPU. Maybe for Llama 70B it might be faster though.
This is what my compute shader for Q40Q80 looks like right now, best result so far for a 4096 x 4096 weight matrix and 1 x 4096 input matrix has been 2ms, which is about the same as the CPU matmul. That is without the overhead of setting up all the buffers, copying to/from GPU memory and dispatching the workload.
There is a lot I still have to figure out, making good headway though.
Probably I'm doing something wrong, but I wanted to compare the performance of CPU with Vulkan on my Mac. Llama.cpp has already implemented it so:
CPU:
llama_print_timings: prompt eval time = 148.06 ms / 31 tokens ( 4.78 ms per token, 209.37 tokens per second)
llama_print_timings: eval time = 7150.30 ms / 254 runs ( 28.15 ms per token, 35.52 tokens per second)
Vulkan:
llama_print_timings: prompt eval time = 517.43 ms / 31 tokens ( 16.69 ms per token, 59.91 tokens per second)
llama_print_timings: eval time = 18112.97 ms / 255 runs ( 71.03 ms per token, 14.08 tokens per second)
Additionaly I get some weird characters in the response, so maybe something is broken.
@DifferentialityDevelopment could you observe any speed up with Vulcan on llama.cpp?
Probably I'm doing something wrong, but I wanted to compare the performance of CPU with Vulkan on my Mac. Llama.cpp has already implemented it so:
CPU:
llama_print_timings: prompt eval time = 148.06 ms / 31 tokens ( 4.78 ms per token, 209.37 tokens per second) llama_print_timings: eval time = 7150.30 ms / 254 runs ( 28.15 ms per token, 35.52 tokens per second)
Vulkan:
llama_print_timings: prompt eval time = 517.43 ms / 31 tokens ( 16.69 ms per token, 59.91 tokens per second) llama_print_timings: eval time = 18112.97 ms / 255 runs ( 71.03 ms per token, 14.08 tokens per second)
Additionaly I get some weird characters in the response, so maybe something is broken.
@DifferentialityDevelopment could you observe any speed up with Vulcan on llama.cpp?
Do you have one of those with the unified memory architecture?
For me Vulkan is much faster than just CPU inference, as it makes use of my RTX 3060. Vulkan is almost just as fast as using the Cuda version of llama.cpp.
Also I know llama.cpp just had a patch that supposedly fixes some issues with Vulkan, so might be you just had an older version?
I've been having a rough time getting Vulkan to work properly and efficiently with distributed Llama. I'm not sure exactly what I'm doing wrong yet. The tests I've run indicate that the Vulkan inference functions are within the margin when compared to the CPU matmul functions. However, when I use the main inference loop, the results are significantly different when using the Vulkan compute shader that handles the QKV using a Vulkan implementation of matmulQ40_Q80. So, I'm not quite sure what's going on.
My plan is to offload as many layers as possible to the GPU at startup (possibly configurable manually through a setting). During inference, it can then use the weights already in GPU memory. It will probably take me a month or more to get it working correctly. I will try to keep the Vulkan branch up to date with the main branch as much as possible so that it will be easier to merge when I eventually make a pull request.
It already offloads the layers to GPU memory on startup (sort of), but the computation results are still not correct.
Do you have one of those with the unified memory architecture?
Yeah, maybe CPU is too fast.
My plan is to offload as many layers as possible to the GPU at startup (possibly configurable manually through a setting).
This is how it works in llama.cpp, there is -ngl
argument.
can i help with testing this? i have some gpus i want to throw into the mix that are not ROCm/CUDA capable, plus i want to try my pi's.
After some time I achieved a tiny progress, I have a faster shader than a single M1 core:
// m1
CPU: 70 ms
GPU: 26 ms
Unfortunately the same shader on Raspberry Pi:
// raspberry pi 5
CPU: 58 ms
GPU: 1045 ms
🤯
The weird thing is that, I noticed vulkaninfo --summary
on Rasberry Pi 5 returns:
Devices:
========
GPU0:
apiVersion = 1.2.255
driverVersion = 23.2.1
vendorID = 0x14e4
deviceID = 0x55701c33
deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
deviceName = V3D 7.1.7
driverID = DRIVER_ID_MESA_V3DV
driverName = V3DV Mesa
driverInfo = Mesa 23.2.1-1~bpo12+rpt3
conformanceVersion = 1.3.6.1
deviceUUID = 5fd8106e-741a-cafa-e080-fdb16cf11a80
driverUUID = 1698c6ef-161f-3213-5159-557202953ee9
GPU1:
apiVersion = 1.3.255
driverVersion = 0.0.1
vendorID = 0x10005
deviceID = 0x0000
deviceType = PHYSICAL_DEVICE_TYPE_CPU
deviceName = llvmpipe (LLVM 15.0.6, 128 bits)
driverID = DRIVER_ID_MESA_LLVMPIPE
driverName = llvmpipe
driverInfo = Mesa 23.2.1-1~bpo12+rpt3 (LLVM 15.0.6)
conformanceVersion = 1.3.1.1
deviceUUID = 6d657361-3233-2e32-2e31-2d317e627000
driverUUID = 6c6c766d-7069-7065-5555-494400000000
And for both devices I have the same speed.
Something I have learned is that just copying the data to vulkan buffers isn't the whole picture, there is a bit of a process of moving it to the GPU memory which is where it's much faster. You have to move the data first to a staging buffer, and then it can be copied by vulkan to the gpu memory, you can't directly copy it from host memory to gpu memory. Also apparently using vulkan memory allocator (VMA) can take over a lot of this, and reduce the amount of boilerplate code necessary. https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/choosing_memory_type.html#choosing_memory_type_usage
Also the warp size, local/global group size etc are also very important to fully utilizing the GPU. I haven't had time to work on it further, well done on getting it to work!
Also thinking about it, raspberry pi has no dedicated GPU memory, it uses the system RAM, it kind of explains why your M1 was faster on GPU as the GPU cores can deal with the data in parallel much more efficiently than the CPU can and both the GPU and CPU has access to the memory at the same speed since it uses the unified memory architecture.
Still I'm sure the raspberry pi's GPU should be able to do the computations faster than it's CPU can, just wonder how to make it happen, I don't have a raspberry pi to test with myself, but I'm soon going to be able to upgrade to a 4 node setup (PC's)
Some progress: 🫣
// raspberry pi 5
CPU: 51 ms
GPU: 303 ms
Some progress: 🫣
// raspberry pi 5 CPU: 51 ms GPU: 303 ms
What size matrices are you testing it with?
n = 4096;
d = 14336;
This requires around 229448 kB in memory (total size of input
, weights
, output
)
I'm trying to implement matrix x vector multiplication. The size is basically taken from Llama model.
https://github.com/LostRuins/koboldcpp/tree/318d5b87fc1602ef16d8271bfdd937ef416a8182/include/vulkan koboldcpp seems to work decently with windows. not sure how it differs from llama.cpp
https://github.com/Const-me/Cgml might offer some insight, although primarily for directx3d
https://github.com/CNugteren/CLBlast seems to be the consensus on embedded and amd hardware
After some time I achieved a tiny progress, I have a faster shader than a single M1 core:
// m1 CPU: 70 ms GPU: 26 ms
Unfortunately the same shader on Raspberry Pi:
// raspberry pi 5 CPU: 58 ms GPU: 1045 ms
🤯
The weird thing is that, I noticed
vulkaninfo --summary
on Rasberry Pi 5 returns:Devices: ======== GPU0: apiVersion = 1.2.255 driverVersion = 23.2.1 vendorID = 0x14e4 deviceID = 0x55701c33 deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU deviceName = V3D 7.1.7 driverID = DRIVER_ID_MESA_V3DV driverName = V3DV Mesa driverInfo = Mesa 23.2.1-1~bpo12+rpt3 conformanceVersion = 1.3.6.1 deviceUUID = 5fd8106e-741a-cafa-e080-fdb16cf11a80 driverUUID = 1698c6ef-161f-3213-5159-557202953ee9 GPU1: apiVersion = 1.3.255 driverVersion = 0.0.1 vendorID = 0x10005 deviceID = 0x0000 deviceType = PHYSICAL_DEVICE_TYPE_CPU deviceName = llvmpipe (LLVM 15.0.6, 128 bits) driverID = DRIVER_ID_MESA_LLVMPIPE driverName = llvmpipe driverInfo = Mesa 23.2.1-1~bpo12+rpt3 (LLVM 15.0.6) conformanceVersion = 1.3.1.1 deviceUUID = 6d657361-3233-2e32-2e31-2d317e627000 driverUUID = 6c6c766d-7069-7065-5555-494400000000
And for both devices I have the same speed.
The reason why Vulkan is slow is here: https://github.com/Tencent/ncnn/issues/2435#issuecomment-1634521856
pi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 10 4 0 0 -1 >> text.out
[0 V3D 7.1.7] queueC=0[1] queueG=0[1] queueT=0[1]
[0 V3D 7.1.7] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0
[0 V3D 7.1.7] fp16-p/s/u/a=1/1/1/0 int8-p/s/u/a=1/1/1/0
[0 V3D 7.1.7] subgroup=16 basic/vote/ballot/shuffle=1/0/0/0
[0 V3D 7.1.7] fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0
Vulkan drivers for the Raspberry Pi lack the arithmetic support for 16-bit floating point and 8-bit integers.
http://raspbian.raspberrypi.com/raspbian/pool/main/c/clblast/ https://forums.raspberrypi.com/viewtopic.php?t=11177 "What is hard-float?"
i asked about hard-float RPi earlier, not realizing that hf in this community means hugging face.
RPi in 32-bit/armhf might be more capable? https://cdimage.ubuntu.com/releases/22.04.4/release/ Preinstalled server image >> Raspberry Pi Generic (Hard-Float) preinstalled server image
https://launchpad.net/ubuntu/jammy/+source/clblast natively available in 64 and 32 bit distributions
https://en.wikipedia.org/wiki/ARM_architecture_family#Floating-point_(VFP):
"...armhf (ARM hard float) refers to the ARMv7 architecture including the additional VFP3-D16 floating-point hardware extension (and Thumb-2) above. Software packages and cross-compiler tools use the armhf vs. arm/armel suffixes to differentiate VFPv4-D16"
"Implemented on most Cortex-A8 and A9 ARMv7 processors. It is backward-compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3 has 32 64-bit FPU registers as standard, adds VCVT instructions to convert between scalar, float and double, adds immediate mode to VMOV such that constants can be loaded into FPU registers." "As above, but it has only 16 64-bit FPU registers. Implemented on Cortex-A5 and A7 processors in the case of an FPU without Neon."
https://en.wikipedia.org/wiki/IEEE_754 https://blog.tensorflow.org/2023/11/half-precision-inference-doubles-on-device-inference-performance.html https://github.com/petewarden/tensorflow_makefile/blob/master/tensorflow/core/framework/bfloat16.h
interesting comparison: https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407
The first version is implemented, probably it may require some adjusments. I cannot observe any acceleration on my Mac but on strong GPUs it may be visible. I added a description how to try to run it here.
Awesome! I have two PC's, one with a RTX 3060 and one with an RTX 2070 Super, Both have the same CPU (Ryzen 3600). They aren't as powerful as an RTX 3090/4090 but should still be much faster than CPU in theory at least. Will do a couple of tests for you on my setup and let you know about the results
Had to adjust a couple of things to get it to compile on windows
Installing Vulkan SDK on windows sets the environment variable VK_SDK_PATH but you still need to make use of it in the makefile so that the #include <vulkan/vulkan.h> can correctly find the header files, and also so the linked can link to the vulkan libraries, See my changes to the makefile below:
ifdef DLLAMA_VULKAN
ifeq ($(OS),Windows_NT)
LIBS += -L$(VK_SDK_PATH)\lib -lvulkan-1
OBJS += accelerator-vulkan.o
CXXFLAGS += -DDLLAMA_VULKAN -I$(VK_SDK_PATH)\include
else
LIBS += -lvulkan
OBJS += accelerator-vulkan.o
CXXFLAGS += -DDLLAMA_VULKAN
endif
accelerator-vulkan.o: src/accelerator-vulkan.cpp
$(CXX) $(CXXFLAGS) -c src/accelerator-vulkan.cpp -o accelerator-vulkan.o
accelerator-vulkan-test: src/accelerator-vulkan-test.cpp funcs utils quants accelerator-vulkan.o
$(CXX) $(CXXFLAGS) src/accelerator-vulkan-test.cpp -o accelerator-vulkan-test funcs.o utils.o quants.o accelerator-vulkan.o $(LIBS)
endif
With that I'm able to compile it, though haven't tested it yet, going to do that in the morning
Hi @b4rtaz
I was tinkering a bit over the weekend and figured it might be possible to create a version of worker/main that accelerates the inference by offloading some work to the GPU to handle.
I've never really worked with compute shaders or Vulkan for that matter but I put together a simple demo that successfully ran a compute shader using Vulkan The compute shader currently just takes an input buffer and copies the data to an output buffer.
This is what I have so far compute-shader-example.zip
My next step is to upgrade it to do a matmul on two matrices and do the same operation on CPU and compare the results, I'm hopeful that I could utilize the worker/root node's dedicated/integrated GPU to do the heavy lifting.
I'll do some experiments on my fork on integrating it once I have a matmul compute shader working and let you know how it goes.
Right now I just want to get something working where I give it two matrices and it computes the resulting matmul output.