Open magicse opened 1 year ago
currently, no vulkan conv1d / deconv1d
Thank You @nihui .
Is there an example of a custom layer template somewhere that VULKAN uses?
Something like this implement-custom-layer-step-by-step but for VULKAN.
I want to try to make my custom Conv1d layer with VULKAN support
Because without VULKAN my HIFI GAN vocoder is quite slow. Vocal phrase in 3 seconds generated in 36 seconds
Additionally, this holds true for the vocoders of both VITS and DiffSinger, in summary, all TTS synthesis relies on this.
I had to create Convolution1D_vulkan.cpp
#include "Convolution1D_vulkan.h"
#include "layer_shader_type.h"
#include "layer_type.h"
Convolution1D_vulkan::Convolution1D_vulkan()
{
one_blob_only = true;
support_vulkan = true;
support_image_storage = true;
pipeline_convolution1d = 0;
reshape_w = 0;
}
int Convolution1D_vulkan::create_pipeline(const Option& _opt)
{
...
}
int Convolution1D_vulkan::destroy_pipeline(const Option&)
{
//
}
int Convolution1D_vulkan::upload_model(VkTransfer& cmd, const Option& opt)
{
....
}
int Convolution1D_vulkan::forward(const VkMat& bottom_blob, VkMat& top_blob, VkCompute& cmd, const Option& opt) const
{
...
}
Convolution1D_vulkan.h
All needed implementations
Main.cpp
#include "Convolution1D_vulkan.h"
DEFINE_LAYER_CREATOR(Convolution1D_vulkan)
...
ncnn::Net HIFIVOICE;
HIFIVOICE.register_custom_layer("Convolution1D_vulkan", Convolution1D_vulkan_layer_creator);
All compiled well But also i have convolution1d.comp. I had to create convolution1d.text2hex.txt and convolution1d.hex.h from convolution1d.comp. As i saw native ncnn shaders for VULKAN calls thru indexes
int shader_type_index = -1;
if (elempack == 1 && out_elempack == 1) shader_type_index = LayerShaderType::convolution;
if (elempack == 4 && out_elempack == 4) shader_type_index = LayerShaderType::convolution_pack4;
pipeline_convolution1d = new Pipeline(vkdev);
pipeline_convolution1d->set_optimal_local_size_xyz(local_size_xyz);
pipeline_convolution1d->create(shader_type_index, opt, specializations);
But i dont know how implement this in my custom layer without layer_shader_type.h and layer_shader_type_enum.h.
I found how make this
static std::vector<uint32_t> spirv;
static ncnn::Mutex lock;
{
ncnn::MutexLockGuard guard(lock);
if (spirv.empty())
{
compile_spirv_module(convolution1d_comp_data, sizeof(convolution1d_comp_data), opt, spirv);
}
}
std::vector<vk_specialization_type> specializations(7 + 10);
specializations[0].i = kernel_w;
specializations[1].i = dilation_w;
specializations[2].i = stride_w;
specializations[3].i = bias_term;
specializations[4].i = activation_type;
specializations[5].f = activation_params.w >= 1 ? activation_params[0] : 0.f;
specializations[6].f = activation_params.w == 2 ? activation_params[1] : 0.f;
specializations[7 + 0].i = shape_bordered_packed.dims;
specializations[7 + 1].i = shape_bordered_packed.w;
specializations[7 + 2].i = shape_bordered_packed.h;
specializations[7 + 3].i = shape_bordered_packed.c;
specializations[7 + 4].i = shape_bordered_packed.cstep;
specializations[7 + 5].i = out_shape_packed.dims;
specializations[7 + 6].i = out_shape_packed.w;
specializations[7 + 7].i = out_shape_packed.h;
specializations[7 + 8].i = out_shape_packed.c;
specializations[7 + 9].i = out_shape_packed.cstep;
Mat local_size_xyz(8, 8, std::min(4, (num_output / out_elempack + 1) / 2), (void*)0);
if (out_shape_packed.dims != 0)
{
local_size_xyz.w = std::min(8, out_shape_packed.w);
local_size_xyz.h = std::min(8, out_shape_packed.h);
local_size_xyz.c = std::min(4, (out_shape_packed.c + 1) / 2);
}
pipeline_convolution1d = new Pipeline(vkdev);
pipeline_convolution1d->set_optimal_local_size_xyz(local_size_xyz);
pipeline_convolution1d->create(spirv.data(), spirv.size() * 4, specializations);
I have only one question in shaders i saw this "afp v0" , sfp , "psc(c) ", "afpvec4" I didn't understand what this the types ... afp, psc , sfp, afpvec4? I couldn't find any information about this.
GLSL data type C data type Description bool int A conditional type, taking on values of true or false. int int Signed integer. float float Single floating-point scalar. vec2 float [2] Two component floating-point vector. vect3 float [3] Three component floating-point vector. vec4 float [4] Four component floating-point vector. bvec2 int [2] Two component Boolean vector. bvec3 int [3] Three component Boolean vector. bvec4 int [4] Four component Boolean vector. ivec2 int [2] Two component signed integer vector. ivec3 int [3] Three component signed integer vector. ivec4 int [4] Four component signed integer vector. mat2 float [4] 2×2 floating-point matrix. mat3 float [9] 3×3 floating-point matrix. mat4 float [16] 4×4 floating-point matrix. sampler1D int Handle for accessing a 1D texture. sampler2D int Handle for accessing a 2D texture. sampler3D int Handle for accessing a 3D texture. samplerCube int Handle for accessing a cubemap texture. sampler1DShadow int A handle for accessing a 1D depth texture with comparison. Sampler2DShadow int A handle for accessing a 2D depth texture with comparison.
I have only one question in shaders i saw this "afp v0" , sfp , "psc(c) ", "afpvec4" I didn't understand what this the types ... afp, psc , sfp, afpvec4? I couldn't find any information about this.
GLSL data type C data type Description bool int A conditional type, taking on values of true or false. int int Signed integer. float float Single floating-point scalar. vec2 float [2] Two component floating-point vector. vect3 float [3] Three component floating-point vector. vec4 float [4] Four component floating-point vector. bvec2 int [2] Two component Boolean vector. bvec3 int [3] Three component Boolean vector. bvec4 int [4] Four component Boolean vector. ivec2 int [2] Two component signed integer vector. ivec3 int [3] Three component signed integer vector. ivec4 int [4] Four component signed integer vector. mat2 float [4] 2×2 floating-point matrix. mat3 float [9] 3×3 floating-point matrix. mat4 float [16] 4×4 floating-point matrix. sampler1D int Handle for accessing a 1D texture. sampler2D int Handle for accessing a 2D texture. sampler3D int Handle for accessing a 3D texture. samplerCube int Handle for accessing a cubemap texture. sampler1DShadow int A handle for accessing a 1D depth texture with comparison. Sampler2DShadow int A handle for accessing a 2D depth texture with comparison.
under construction ... https://github.com/nihui/ncnn/blob/doc-glsl-ext/docs/developer-guide/glsl-extension.md
Hi @nihui , thank You for link and helping. Now i try create conv1d shader.... May it will be ready soon )))
Hi @nihui , thank You for link and helping. Now i try create conv1d shader.... May it will be ready soon )))
Hi, you can join ncnn qq group if you use qq (see ncnn readme) thru which I can provide more help in time
Work in progress
convolution1d.comp for kernel_w > 1 and elempack 1
#version 450
#if NCNN_fp16_storage
#extension GL_EXT_shader_16bit_storage: require
#endif
#if NCNN_fp16_arithmetic
#extension GL_EXT_shader_explicit_arithmetic_types_float16: require
#endif
#extension GL_EXT_debug_printf : enable
#extension GL_GOOGLE_include_directive: enable
#include "vulkan_activation.comp"
layout (constant_id = 0) const int kernel_w = 1;
layout (constant_id = 1) const int dilation_w = 1;
layout (constant_id = 2) const int stride_w = 1;
layout (constant_id = 3) const int bias_term = 0;
layout (constant_id = 4) const int activation_type = 0;
layout (constant_id = 5) const float activation_param_0 = 0;
layout (constant_id = 6) const float activation_param_1 = 0;
#define shape_constant_id_offset 7
layout (constant_id = shape_constant_id_offset + 0) const int dims = 0;
layout (constant_id = shape_constant_id_offset + 1) const int w = 0;
layout (constant_id = shape_constant_id_offset + 2) const int h = 0;
layout (constant_id = shape_constant_id_offset + 3) const int c = 0;
layout (constant_id = shape_constant_id_offset + 4) const int cstep = 0;
layout (constant_id = shape_constant_id_offset + 5) const int outdims = 0;
layout (constant_id = shape_constant_id_offset + 6) const int outw = 0;
layout (constant_id = shape_constant_id_offset + 7) const int outh = 0;
layout (constant_id = shape_constant_id_offset + 8) const int outc = 0;
layout (constant_id = shape_constant_id_offset + 9) const int outcstep = 0;
#if NCNN_image_shader
layout (binding = 0) uniform unfp sampler2D bottom_blob;
layout (binding = 1, imfmtc1) writeonly uniform unfp image2D top_blob;
layout (binding = 2) uniform unfp sampler3D weight_blob;
layout (binding = 3) uniform unfp sampler3D bias_blob;
#else
layout (binding = 0) readonly buffer bottom_blob { sfp bottom_blob_data[]; };
layout (binding = 1) writeonly buffer top_blob { sfp top_blob_data[]; };
layout (binding = 2) readonly buffer weight_blob { sfp weight_data[]; };
layout (binding = 3) readonly buffer bias_blob { sfp bias_data[]; };
#endif
layout (push_constant) uniform parameter
{
int dims;
int w;
int h;
int c;
int cstep;
int outdims;
int outw;
int outh;
int outc;
int outcstep;
} p;
void print_bottom_blob()
{
int gx = int(gl_GlobalInvocationID.x);
int gy = int(gl_GlobalInvocationID.y);
int gz = int(gl_GlobalInvocationID.z);
if (gx >= 1 || gy >= 1)
return;
debugPrintfEXT("Hello %i, %i\n", gx, gy);
for (int i = 0; i < psc(w); ++i) {
for (int j = 0; j < psc(h); ++j) {
debugPrintfEXT("Elem %d %d: %f ", i, j, bottom_blob_data[i*psc(h)+j]);
}
debugPrintfEXT("\n");
}
}
void main()
{
int gx = int(gl_GlobalInvocationID.x) * 2;
int gy = int(gl_GlobalInvocationID.y) * 2;
int gz = int(gl_GlobalInvocationID.z) * 2;
//print_bottom_blob();
if (gx >= psc(outw) || gy >= psc(outh) || gz >= psc(outc))
return;
const ivec2 gx2 = gx + ivec2(0, 1);
const ivec2 gy2 = gy + ivec2(0, 1);
afp sum0 = afp(0.0f);
afp sum1 = afp(0.0f);
afp sum2 = afp(0.0f);
afp sum3 = afp(0.0f);
if (bias_term == 1)
{
#if NCNN_image_shader
//sum = image2d_ld1(bias_blob, ivec2(gx, 0));
#else
sum0 = buffer_ld1(bias_data, gy2.x);
sum2 = buffer_ld1(bias_data, gy2.y);
sum1 = sum0;
sum3 = sum2;
#endif
}
#if NCNN_image_shader
//
#else
ivec2 w_offsetv = kernel_w * psc(h) * gy2; // weight offset
for (int iny = 0; iny < psc(h); iny++)
{
ivec2 v_offsetv = iny * psc(w) + gx2 * stride_w; // value offset
for (int x = 0; x < kernel_w; x++)
{
afp v0 = buffer_ld1(bottom_blob_data, v_offsetv.x + x * dilation_w); // Load the value +0
afp v1 = buffer_ld1(bottom_blob_data, v_offsetv.y + x * dilation_w); // Load the value +1
afp k0 = buffer_ld1(weight_data, w_offsetv.x + x); // Load the weight value +0
afp k1 = buffer_ld1(weight_data, w_offsetv.y + x); // Load the weight value +1
sum0 += v0 * k0;
sum1 += v1 * k0;
sum2 += v0 * k1;
sum3 += v1 * k1;
}
w_offsetv += kernel_w; // Move to the next set of weights
}
#endif
sum0 = activation_afp(sum0, activation_type, activation_param_0, activation_param_1);
sum1 = activation_afp(sum1, activation_type, activation_param_0, activation_param_1);
sum2 = activation_afp(sum2, activation_type, activation_param_0, activation_param_1);
sum3 = activation_afp(sum3, activation_type, activation_param_0, activation_param_1);
#if NCNN_image_shader
//image2d_st1(top_blob, ivec3(gx2.x, gy2.x, gz2.x), sum0);
//image2d_st1(top_blob, ivec3(gx2.y, gy2.x, gz2.x), sum1);
//image2d_st1(top_blob, ivec3(gx2.x, gy2.y, gz2.x), sum2);
//image2d_st1(top_blob, ivec3(gx2.y, gy2.y, gz2.x), sum3);
#else
if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.x * psc(outw) + gx2.x, sum0);
if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.x * psc(outw) + gx2.y, sum1);
if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.y * psc(outw) + gx2.x, sum2);
if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.y * psc(outw) + gx2.y, sum3);
#endif
}
My convolution1d.comp for kernel_w > 1 and elempack 1 ( unpacked float 32) work correct and produce correct results.
Convolution1D_vulkan.cpp arranged as custom layer like
And all work correctly .
But i have one problem. create_pipeline procedure Convolution1D_vulkan::create_pipeline(const Option& _opt)
receives all parameters correctly except bottom_shapes and top_shapes, it always empty.
Main.cpp
#include "Convolution1D_vulkan.h"
DEFINE_LAYER_CREATOR(Convolution1D_vulkan)
....
ncnn::Net HIFIVOICE;
HIFIVOICE.opt.use_fp16_packed = false;
HIFIVOICE.opt.use_fp16_storage = false;
HIFIVOICE.opt.use_fp16_arithmetic = false;
HIFIVOICE.opt.use_int8_storage = false;
HIFIVOICE.opt.use_int8_arithmetic = false;
HIFIVOICE.opt.use_int8_packed = false;
HIFIVOICE.opt.use_vulkan_compute = true;
HIFIVOICE.register_custom_layer("Convolution1D_vulkan", Convolution1D_vulkan_layer_creator);
....
int Convolution1D_vulkan::create_pipeline(const Option& _opt)
{
std::cout << "=== Create Pipeline: ===" << std::endl;
if (dynamic_weight)
{
support_vulkan = false;
support_image_storage = false;
return 0;
}
// Create a convolution pipeline using Vulkan
Option opt = _opt;
const Mat& shape = bottom_shapes.empty() ? Mat() : bottom_shapes[0];
const Mat& out_shape = top_shapes.empty() ? Mat() : top_shapes[0];
std::cout << "=== Create Pipeline: ===" << std::endl;
std::cout << "=== Create Pipeline: New shape from bottom_shapes WxHxC x Dims ===" << shape.w << " " << shape.h << " " << shape.c << " " << shape.d <<std::endl;
std::cout << "=== Create Pipeline: New out_shape from top_shapes WxHxC x Dims ===" << out_shape.w << " " << out_shape.h << " " << out_shape.c << " " << out_shape.d <<std::endl;
Output
=== Create Pipeline: ===
=== Create Pipeline: New shape from bottom_shapes WxHxC x Dims ===0 0 0 0
=== Create Pipeline: New out_shape from top_shapes WxHxC x Dims ===0 0 0 0
=== Create Pipeline: padding pipeline pad_left pad_right : ===3 3
=== Create Pipeline: data_packed.create maxk : ===7
=== Create Pipeline: data_packed.create num_input : ===5120
=== Create Pipeline: data_packed.create elempack : ===4
=== Create Pipeline: data_packed.create num_input / elempack : ===1280
=== Create Pipeline: data_packed.create num_output : ===8
=== Create Pipeline: data_packed.create out_elempack : ===4
=== Create Pipeline: data_packed.create num_output / out_elempack : ===2
=== Create Pipeline: data_packed.create (size_t)4 * elempack * out_elempack : ===64
=== Create Pipeline: data_packed.create elempack * out_elempack : ===16
=== Create Pipeline: weight_data WxHxC : === 286720 x 1 x 1
=== Create Pipeline: weight_data WxHxC reshaped : === 7 x 5120 x 8
I made Convolution1d for the GPU (for float 32 pack1 and pack4 blobs and unpacked weights) And now a 7 second voice phrase is generated in 7 seconds. Good results.
7 second voice phrase without GPU Inference duration:
------------------------
Inference duration: 177 seconds
Out matrix size W x H = 173312 x 1 number of channels 1
7 second voice phrase with GPU Inference duration:
------------------------
Inference duration: 7 seconds
Out matrix size W x H = 173312 x 1 number of channels 1
This is realtime.
convolution1d_pack4.comp (float 32 pack4 blobs and input unpacked weights)
#version 450
#if NCNN_fp16_storage
#extension GL_EXT_shader_16bit_storage: require
#endif
#if NCNN_fp16_arithmetic
#extension GL_EXT_shader_explicit_arithmetic_types_float16: require
#endif
//#extension GL_EXT_debug_printf : enable
#extension GL_GOOGLE_include_directive: enable
#include "vulkan_activation.comp"
layout (constant_id = 0) const int kernel_w = 1;
layout (constant_id = 1) const int dilation_w = 1;
layout (constant_id = 2) const int stride_w = 1;
layout (constant_id = 3) const int bias_term = 0;
layout (constant_id = 4) const int activation_type = 0;
layout (constant_id = 5) const float activation_param_0 = 0;
layout (constant_id = 6) const float activation_param_1 = 0;
#define shape_constant_id_offset 7
layout (constant_id = shape_constant_id_offset + 0) const int dims = 0;
layout (constant_id = shape_constant_id_offset + 1) const int w = 0;
layout (constant_id = shape_constant_id_offset + 2) const int h = 0;
layout (constant_id = shape_constant_id_offset + 3) const int c = 0;
layout (constant_id = shape_constant_id_offset + 4) const int cstep = 0;
layout (constant_id = shape_constant_id_offset + 5) const int outdims = 0;
layout (constant_id = shape_constant_id_offset + 6) const int outw = 0;
layout (constant_id = shape_constant_id_offset + 7) const int outh = 0;
layout (constant_id = shape_constant_id_offset + 8) const int outc = 0;
layout (constant_id = shape_constant_id_offset + 9) const int outcstep = 0;
#if NCNN_image_shader
layout (binding = 0) uniform unfp sampler3D bottom_blob;
layout (binding = 1, imfmtc4) writeonly uniform unfp image3D top_blob;
layout (binding = 2) uniform unfp sampler3D weight_blob;
layout (binding = 3) uniform unfp sampler3D bias_blob;
#else
//layout (binding = 0) readonly buffer bottom_blob { sfp bottom_blob_data[]; };
layout (binding = 0) readonly buffer bottom_blob { sfpvec4 bottom_blob_data[]; };
//layout (binding = 1) writeonly buffer top_blob { sfp top_blob_data[]; };
layout (binding = 1) writeonly buffer top_blob { sfpvec4 top_blob_data[]; };
//layout (binding = 2) readonly buffer weight_blob { sfp weight_data[]; };
//layout (binding = 3) readonly buffer bias_blob { sfp bias_data[]; };
#if NCNN_fp16_packed || (NCNN_fp16_storage && !NCNN_fp16_arithmetic)
layout (binding = 2) readonly buffer weight_blob { sfpvec4 weight_data[]; };
#else
//layout (binding = 2) readonly buffer weight_blob { sfpmat4 weight_data[]; };
//layout (binding = 2) readonly buffer weight_blob { sfpvec4 weight_data[]; };
layout (binding = 2) readonly buffer weight_blob { sfp weight_data[]; };
#endif
layout (binding = 3) readonly buffer bias_blob { sfpvec4 bias_data[]; };
#endif
layout (push_constant) uniform parameter
{
int dims;
int w;
int h;
int c;
int cstep;
int outdims;
int outw;
int outh;
int outc;
int outcstep;
} p;
/*
void print_bottblob()
{
int gx = int(gl_GlobalInvocationID.x);
int gy = int(gl_GlobalInvocationID.y);
int gz = int(gl_GlobalInvocationID.z);
if (gx >= 1 || gy >= 1 || gz >= 1)
return;
//debugPrintfEXT("Hello %i, %i\n", gx, gy);
for (int i = 0; i < psc(h)/4; ++i) {
for (int j = 0; j < psc(w); ++j) {
//for (int j = 0; j < psc(h); ++j) {
//afp v = buffer_ld1(bottom_blob_data, 3);
//debugPrintfEXT("Elem %d %d: %f ", i, j, v);
//debugPrintfEXT("Bot_Blob %d %d: %f ", i, j, bottom_blob_data[i*psc(h)+j]);
afpvec4 test = buffer_ld4(bottom_blob_data, i*psc(w)+j);
debugPrintfEXT(" Top_Blob %d %d: %v4f ", i, j, test);
//afpvec4 value;
//value = buffer_ld4(bottom_blob_data, i*psc(h)+j );
//debugPrintfEXT("Bot_Blob %d %d: %f ", i, j, value);
}
debugPrintfEXT("\n");
}
}
void print_weight()
{
int gx = int(gl_GlobalInvocationID.x);
int gy = int(gl_GlobalInvocationID.y);
int gz = int(gl_GlobalInvocationID.z);
if (gx >= 1 || gy >= 1 || gz >= 1)
return;
debugPrintfEXT("Hello %i, %i\n", gx, gy);
for (int i = 0; i < psc(outh)*4; ++i) {
for (int j = 0; j < psc(outw)*kernel_w; ++j) {
//afp v = buffer_ld1(bottom_blob_data, 3);
//debugPrintfEXT("Elem %d %d: %f ", i, j, v);
debugPrintfEXT("Weight %d %d: %f ", i, j, weight_data[i*psc(outw)*kernel_w+j]);
//afpvec4 test = buffer_ld4(weight_data, i*psc(outw)+j);
//debugPrintfEXT(" Weight %d %d: %v4f ", i, j, test);
}
debugPrintfEXT("\n");
}
}
*/
void main()
{
int gx = int(gl_GlobalInvocationID.x) * 2;
int gy = int(gl_GlobalInvocationID.y) * 2;
int gz = int(gl_GlobalInvocationID.z) * 2;
//print_bottblob();
//print_weight();
if (gx >= psc(outw) || gy >= psc(outh) || gz >= psc(outc))
return;
const ivec2 gx2 = gx + ivec2(0, 1);
const ivec2 gy2 = gy + ivec2(0, 1);
const ivec2 gy4 = gy*4 + ivec2(0, 4);
const ivec2 gz2 = gz + ivec2(0, 1);
afpvec4 sum0 = afpvec4(0.0f);
afpvec4 sum1 = afpvec4(0.0f);
afpvec4 sum2 = afpvec4(0.0f);
afpvec4 sum3 = afpvec4(0.0f);
afpvec4 sum4 = afpvec4(0.0f);
afpvec4 sum5 = afpvec4(0.0f);
afpvec4 sum6 = afpvec4(0.0f);
afpvec4 sum7 = afpvec4(0.0f);
afpvec4 sum8 = afpvec4(0.0f);
afpvec4 sum9 = afpvec4(0.0f);
afpvec4 sum10 = afpvec4(0.0f);
afpvec4 sum11 = afpvec4(0.0f);
afpvec4 sum12 = afpvec4(0.0f);
afpvec4 sum13 = afpvec4(0.0f);
afpvec4 sum14 = afpvec4(0.0f);
afpvec4 sum15 = afpvec4(0.0f);
afpvec4 sum16 = afpvec4(0.0f);
afpvec4 sum17 = afpvec4(0.0f);
afpvec4 sum18 = afpvec4(0.0f);
afpvec4 sum19 = afpvec4(0.0f);
if (bias_term == 1)
{
#if NCNN_image_shader
sum = image2d_ld1(bias_blob, ivec2(gx, 0));
#else
sum4 = buffer_ld4(bias_data, gy2.x);
sum5 = sum4;
sum14 = buffer_ld4(bias_data, gy2.y);
sum15 = sum14;
#endif
}
#if NCNN_image_shader
//
#else
ivec4 gy4_0 = gy4.x + ivec4(0, 1, 2, 3);
ivec4 gy4_1 = gy4.y + ivec4(0, 1, 2, 3);
ivec4 w_offsetv4_0;
ivec4 w_offsetv4_1;
w_offsetv4_0 = kernel_w * psc(h) * 4 * gy4_0;
w_offsetv4_1 = kernel_w * psc(h) * 4 * gy4_1;
for (int iny = 0; iny < psc(h); iny++)
{
ivec2 v_offsetv = iny * psc(w) + gx2 * stride_w;
for (int x = 0; x < kernel_w; x++)
{
afpvec4 v0 = buffer_ld4(bottom_blob_data, v_offsetv.x + x * dilation_w);
afpvec4 v1 = buffer_ld4(bottom_blob_data, v_offsetv.y + x * dilation_w);
afp k0 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 0); // Load the weight value
afp k1 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 1); // Load the weight value
afp k2 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 2); // Load the weight value
afp k3 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 3); // Load the weight value
afp k4 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 0); // Load the weight value
afp k5 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 1); // Load the weight value
afp k6 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 2); // Load the weight value
afp k7 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 3); // Load the weight value
afp k8 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 0); // Load the weight value
afp k9 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 1); // Load the weight value
afp k10 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 2); // Load the weight value
afp k11 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 3); // Load the weight value
afp k12 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 0); // Load the weight value
afp k13 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 1); // Load the weight value
afp k14 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 2); // Load the weight value
afp k15 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 3); // Load the weight value
afp k16 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 0); // Load the weight value
afp k17 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 1); // Load the weight value
afp k18 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 2); // Load the weight value
afp k19 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 3); // Load the weight value
afp k20 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 0); // Load the weight value
afp k21 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 1); // Load the weight value
afp k22 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 2); // Load the weight value
afp k23 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 3); // Load the weight value
afp k24 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 0); // Load the weight value
afp k25 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 1); // Load the weight value
afp k26 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 2); // Load the weight value
afp k27 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 3); // Load the weight value
afp k28 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 0); // Load the weight value
afp k29 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 1); // Load the weight value
afp k30 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 2); // Load the weight value
afp k31 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 3); // Load the weight value
#if NCNN_fp16_packed || (NCNN_fp16_storage && !NCNN_fp16_arithmetic)
// GL_EXT_shader_16bit_storage does not define f16mat4 type :(
afpmat4 k0 = afpmat4(
buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 0),
buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 1),
buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 2),
buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 3)
);
afpmat4 k1 = afpmat4(
buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 0),
buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 1),
buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 2),
buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 3)
);
#else
#endif
//debugPrintfEXT(" k0, k1, k2, k3 %f, %f, %f, %f \n", k0, k1, k2, k3);
//debugPrintfEXT(" k4, k5, k6, k7 %f, %f, %f, %f \n", k4, k5, k6, k7);
sum0 += v0 * afpvec4(k0, k1, k2, k3); //* k0;
sum1 += v1 * afpvec4(k0, k1, k2, k3); //* k0;
sum2 += v0 * afpvec4(k4, k5, k6, k7); //* k1;
sum3 += v1 * afpvec4(k4, k5, k6, k7); //* k1;
sum6 += v0 * afpvec4(k8, k9, k10, k11); //* k0;
sum7 += v1 * afpvec4(k8, k9, k10, k11); //* k0;
sum8 += v0 * afpvec4(k12, k13, k14, k15); //* k1;
sum9 += v1 * afpvec4(k12, k13, k14, k15); //* k1;
sum10 += v0 * afpvec4(k16, k17, k18, k19); //* k0;
sum11 += v1 * afpvec4(k16, k17, k18, k19); //* k0;
sum12 += v0 * afpvec4(k20, k21, k22, k23); //* k1;
sum13 += v1 * afpvec4(k20, k21, k22, k23); //* k1;
sum16 += v0 * afpvec4(k24, k25, k26, k27); //* k0;
sum17 += v1 * afpvec4(k24, k25, k26, k27); //* k0;
sum18 += v0 * afpvec4(k28, k29, k30, k31); //* k1;
sum19 += v1 * afpvec4(k28, k29, k30, k31); //* k1;
}
w_offsetv4_0 += kernel_w*4;
w_offsetv4_1 += kernel_w*4;
}
sum4.x += sum0.x + sum0.y + sum0.z + sum0.w;
sum4.y += sum2.x + sum2.y + sum2.z + sum2.w;
sum4.z += sum6.x + sum6.y + sum6.z + sum6.w;
sum4.w += sum8.x + sum8.y + sum8.z + sum8.w;
sum5.x += sum1.x + sum1.y + sum1.z + sum1.w;
sum5.y += sum3.x + sum3.y + sum3.z + sum3.w;
sum5.z += sum7.x + sum7.y + sum7.z + sum7.w;
sum5.w += sum9.x + sum9.y + sum9.z + sum9.w;
sum14.x += sum10.x + sum10.y + sum10.z + sum10.w;
sum14.y += sum12.x + sum12.y + sum12.z + sum12.w;
sum14.z += sum16.x + sum16.y + sum16.z + sum16.w;
sum14.w += sum18.x + sum18.y + sum18.z + sum18.w;
sum15.x += sum11.x + sum11.y + sum11.z + sum11.w;
sum15.y += sum13.x + sum13.y + sum13.z + sum13.w;
sum15.z += sum17.x + sum17.y + sum17.z + sum17.w;
sum15.w += sum19.x + sum19.y + sum19.z + sum19.w;
#endif
sum4 = activation_afpvec4(sum4, activation_type, activation_param_0, activation_param_1);
sum5 = activation_afpvec4(sum5, activation_type, activation_param_0, activation_param_1);
sum14 = activation_afpvec4(sum14, activation_type, activation_param_0, activation_param_1);
sum15 = activation_afpvec4(sum15, activation_type, activation_param_0, activation_param_1);
#if NCNN_image_shader
image2d_st1(top_blob, ivec3(gx2.x, gy2.x, gz2.x), sum0);
image2d_st1(top_blob, ivec3(gx2.y, gy2.x, gz2.x), sum1);
image2d_st1(top_blob, ivec3(gx2.x, gy2.y, gz2.x), sum2);
image2d_st1(top_blob, ivec3(gx2.y, gy2.y, gz2.x), sum3);
#else
if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.x * psc(outw) + gx2.x, sum4);
if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.x * psc(outw) + gx2.y, sum5);
if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.y * psc(outw) + gx2.x, sum14);
if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.y * psc(outw) + gx2.y, sum15);
#endif
}
I have finished creating working convolution1d_vulkan for fp32
opt.use_fp16_packed = false;
opt.use_fp16_storage = false;
opt.use_fp16_arithmetic = false;
opt.use_int8_storage = false;
opt.use_int8_arithmetic = false;
opt.use_int8_packed = false;
convolution1d.comp convolution1d_pack1to4.comp convolution1d_pack4.comp convolution1d_pack4to1.comp in progress
Inference duration for this mel spectrogram: 5 seconds
vulkan conv1d https://github.com/Tencent/ncnn/pull/5060
hi @nihui I tried new https://github.com/Tencent/ncnn/pull/5060/files conv1d comp shaders and layer (convolution1d_vulkan.cpp, convolution1d_vulkan.h) and it doesn't work correctly, I don't get sound. The output I get is just noise. Try my model ncnn-hifi-GAN with opt.use_vulkan_compute = true (i get noise) and opt.use_vulkan_compute = false (i get sound). But with my own shaders for fp32 It worked correctly and i get sound. Convolution1D_vulkan.cpp Convolution1D_vulkan.h convolution1d.comp convolution1d_pack1to4.comp convolution1d_pack4.comp
try disabling fp16
The following test print the same result on cpu and gpu
int main()
{
ncnn::Net net;
net.opt.use_vulkan_compute = true;
// net.opt.use_vulkan_compute = false;
net.opt.use_fp16_packed = false;
net.opt.use_fp16_storage = false;
net.opt.use_fp16_arithmetic = false;
net.load_param("/home/nihui/osd/ncnn-nihui/mytools/hifivoice.ncnn.param");
net.load_model("/home/nihui/osd/ncnn-nihui/mytools/hifivoice.ncnn.bin");
{
ncnn::Extractor ex = net.create_extractor();
ex.set_vulkan_compute(false);
ncnn::Mat in0 = RandomMat(64, 80);
ex.input("in0", in0);
ncnn::Mat out0;
ex.extract("out0", out0);
fprintf(stderr, "out0 %d %d %d %d %d\n", out0.dims, out0.w, out0.h, out0.d, out0.c);
fprintf(stderr, "out0 %f %f %f %f %f %f\n", out0[0], out0[1], out0[10], out0[20], out0[200], out0[1020]);
}
{
ncnn::Extractor ex = net.create_extractor();
ex.set_vulkan_compute(true);
ncnn::Mat in0 = RandomMat(64, 80);
ex.input("in0", in0);
ncnn::Mat out0;
ex.extract("out0", out0);
fprintf(stderr, "out0 %d %d %d %d %d\n", out0.dims, out0.w, out0.h, out0.d, out0.c);
fprintf(stderr, "out0 %f %f %f %f %f %f\n", out0[0], out0[1], out0[10], out0[20], out0[200], out0[1020]);
}
return 0;
}
[nihui@nihuini-LC2 mytools]$ ./testnet
[0 AMD Radeon Graphics (RADV NAVI14)] queueC=1[4] queueG=0[1] queueT=0[1]
[0 AMD Radeon Graphics (RADV NAVI14)] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0
[0 AMD Radeon Graphics (RADV NAVI14)] fp16-p/s/a=1/1/1 int8-p/s/a=1/1/1
[0 AMD Radeon Graphics (RADV NAVI14)] subgroup=64 basic/vote/ballot/shuffle=1/1/1/1
[0 AMD Radeon Graphics (RADV NAVI14)] fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0
[1 llvmpipe (LLVM 16.0.6, 256 bits)] queueC=0[1] queueG=0[1] queueT=0[1]
[1 llvmpipe (LLVM 16.0.6, 256 bits)] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0
[1 llvmpipe (LLVM 16.0.6, 256 bits)] fp16-p/s/a=1/1/1 int8-p/s/a=1/1/1
[1 llvmpipe (LLVM 16.0.6, 256 bits)] subgroup=8 basic/vote/ballot/shuffle=1/1/1/1
[1 llvmpipe (LLVM 16.0.6, 256 bits)] fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0
out0 2 16384 1 1 1
out0 0.048641 0.074993 -0.038795 0.100711 -0.048371 0.117223
out0 2 16384 1 1 1
out0 0.048641 0.074993 -0.038795 0.100711 -0.048371 0.117224
Hi @nihui , thank you for your work. Now ncnn is open to new directions such as sound synthesis, voice conversion, music synthesis and TTS. I check your code and of course I get correct results.
Z:\AI_SDK\VAE-GAN\HIFIVoice_cpp>hifivoice.exe -i melgram_flipped.jpg
Input option value=melgram_flipped.jpg
path = melgram_flipped.jpgimagepath0: melgram_flipped.jpg
argv[0]: mel
argv[1]: melgram_flipped.jpg
[0 NVIDIA GeForce RTX 3060] queueC=2[8] queueG=0[16] queueT=1[2]
[0 NVIDIA GeForce RTX 3060] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0
[0 NVIDIA GeForce RTX 3060] fp16-p/s/a=1/1/1 int8-p/s/a=1/1/1
[0 NVIDIA GeForce RTX 3060] subgroup=32 basic=1 vote=1 ballot=1 shuffle=1
in0 2 64 80 1 1
out0 2 16384 1 1 1
out0 0.048641 0.074994 -0.038795 0.100711 -0.048369 0.117221
out0 2 16384 1 1 1
out0 0.048641 0.074994 -0.038795 0.100711 -0.048372 0.117223
Max mel magnitude val: 1.89804
Min mel magnitude val: -11
[677 x 80]; ch: 1
MelIn 3 677 80 1 1
Inference duration: 6 seconds
Out matrix size W x H = 173312 x 1 number of channels 1
Final
Z:\AI_SDK\VAE-GAN\HIFIVoice_cpp>
I also found what the problem was. convolution1d with vulkan=true and convolution1d with vulkan=false handle ncnn::Mat with an incorrect dimension differently.
For example, convolution1d is waiting for input dimension 2, and I passed ncnn:Mat with dimension 3.
convolution1d with vulkan=false treats ncnn::Mat with dimension 3 correctly as dimension 2, but convolution1d with vulkan=true produces the wrong result. My code was like this
ncnn::Mat MelIn(melscpectro.cols, melscpectro.rows, 1, (void*)melscpectro.data);
fprintf(stderr, "MelIn %d %d %d %d %d\n", MelIn.dims, MelIn.w, MelIn.h, MelIn.d, MelIn.c);
and I was getting an erroneous result with vulkan=true because dims=3
MelIn 3 677 80 1 1
Now I have changed the code
ncnn::Mat MelIn(melscpectro.cols, melscpectro.rows, (void*)melscpectro.data);
fprintf(stderr, "MelIn %d %d %d %d %d\n", MelIn.dims, MelIn.w, MelIn.h, MelIn.d, MelIn.c);
Output:
MelIn 2 677 80 1 1
and I get the correct result with convolution1d vulkan=true Thank you again @nihui for your work !!
Simple question. My model has many Convolution1D and Deconvolution1D layers. the execution time on CPU and VULKAN is about the same. I just wanted to know if ncnn supports VULKAN acceleration for Convolution1D and Deconvolution1D layers?