Tencent / ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform
Other
20.6k stars 4.18k forks source link

Convolution1D and Deconvolution1D layers #4811

Open magicse opened 1 year ago

magicse commented 1 year ago

Simple question. My model has many Convolution1D and Deconvolution1D layers. the execution time on CPU and VULKAN is about the same. I just wanted to know if ncnn supports VULKAN acceleration for Convolution1D and Deconvolution1D layers?

nihui commented 1 year ago

currently, no vulkan conv1d / deconv1d

magicse commented 1 year ago

Thank You @nihui . Is there an example of a custom layer template somewhere that VULKAN uses? Something like this implement-custom-layer-step-by-step but for VULKAN.
I want to try to make my custom Conv1d layer with VULKAN support Because without VULKAN my HIFI GAN vocoder is quite slow. Vocal phrase in 3 seconds generated in 36 seconds

Baiyuetribe commented 1 year ago

Additionally, this holds true for the vocoders of both VITS and DiffSinger, in summary, all TTS synthesis relies on this.

magicse commented 1 year ago

I had to create Convolution1D_vulkan.cpp

#include "Convolution1D_vulkan.h"
#include "layer_shader_type.h"
#include "layer_type.h"

Convolution1D_vulkan::Convolution1D_vulkan()
{
    one_blob_only = true;
    support_vulkan = true;
    support_image_storage = true;
    pipeline_convolution1d = 0;
    reshape_w = 0;
}
int Convolution1D_vulkan::create_pipeline(const Option& _opt)
{
 ...
}
int Convolution1D_vulkan::destroy_pipeline(const Option&)
{
    //
}
int Convolution1D_vulkan::upload_model(VkTransfer& cmd, const Option& opt)
{
....
}
int Convolution1D_vulkan::forward(const VkMat& bottom_blob, VkMat& top_blob, VkCompute& cmd, const Option& opt) const
{
...
}

Convolution1D_vulkan.h

All needed implementations

Main.cpp

#include "Convolution1D_vulkan.h"
DEFINE_LAYER_CREATOR(Convolution1D_vulkan)
...
ncnn::Net HIFIVOICE;
HIFIVOICE.register_custom_layer("Convolution1D_vulkan", Convolution1D_vulkan_layer_creator);

All compiled well But also i have convolution1d.comp. I had to create convolution1d.text2hex.txt and convolution1d.hex.h from convolution1d.comp. As i saw native ncnn shaders for VULKAN calls thru indexes

       int shader_type_index = -1;
        if (elempack == 1 && out_elempack == 1) shader_type_index = LayerShaderType::convolution;
        if (elempack == 4 && out_elempack == 4) shader_type_index = LayerShaderType::convolution_pack4;

        pipeline_convolution1d = new Pipeline(vkdev);
        pipeline_convolution1d->set_optimal_local_size_xyz(local_size_xyz);
        pipeline_convolution1d->create(shader_type_index, opt, specializations);

But i dont know how implement this in my custom layer without layer_shader_type.h and layer_shader_type_enum.h.

magicse commented 1 year ago

I found how make this

    static std::vector<uint32_t> spirv;

    static ncnn::Mutex lock;
    {
        ncnn::MutexLockGuard guard(lock);
        if (spirv.empty())
        {
            compile_spirv_module(convolution1d_comp_data, sizeof(convolution1d_comp_data), opt, spirv);
        }

    }

       std::vector<vk_specialization_type> specializations(7 + 10);
        specializations[0].i = kernel_w;
        specializations[1].i = dilation_w;
        specializations[2].i = stride_w;
        specializations[3].i = bias_term;
        specializations[4].i = activation_type;
        specializations[5].f = activation_params.w >= 1 ? activation_params[0] : 0.f;
        specializations[6].f = activation_params.w == 2 ? activation_params[1] : 0.f;
        specializations[7 + 0].i = shape_bordered_packed.dims;
        specializations[7 + 1].i = shape_bordered_packed.w;
        specializations[7 + 2].i = shape_bordered_packed.h;
        specializations[7 + 3].i = shape_bordered_packed.c;
        specializations[7 + 4].i = shape_bordered_packed.cstep;
        specializations[7 + 5].i = out_shape_packed.dims;
        specializations[7 + 6].i = out_shape_packed.w;
        specializations[7 + 7].i = out_shape_packed.h;
        specializations[7 + 8].i = out_shape_packed.c;
        specializations[7 + 9].i = out_shape_packed.cstep;

       Mat local_size_xyz(8, 8, std::min(4, (num_output / out_elempack + 1) / 2), (void*)0);
        if (out_shape_packed.dims != 0)
        {
            local_size_xyz.w = std::min(8, out_shape_packed.w);
            local_size_xyz.h = std::min(8, out_shape_packed.h);
            local_size_xyz.c = std::min(4, (out_shape_packed.c + 1) / 2);
        }

        pipeline_convolution1d = new Pipeline(vkdev);
        pipeline_convolution1d->set_optimal_local_size_xyz(local_size_xyz);
        pipeline_convolution1d->create(spirv.data(), spirv.size() * 4, specializations);
magicse commented 1 year ago

I have only one question in shaders i saw this "afp v0" , sfp , "psc(c) ", "afpvec4" I didn't understand what this the types ... afp, psc , sfp, afpvec4? I couldn't find any information about this.

GLSL data type C data type Description bool int A conditional type, taking on values of true or false. int int Signed integer. float float Single floating-point scalar. vec2 float [2] Two component floating-point vector. vect3 float [3] Three component floating-point vector. vec4 float [4] Four component floating-point vector. bvec2 int [2] Two component Boolean vector. bvec3 int [3] Three component Boolean vector. bvec4 int [4] Four component Boolean vector. ivec2 int [2] Two component signed integer vector. ivec3 int [3] Three component signed integer vector. ivec4 int [4] Four component signed integer vector. mat2 float [4] 2×2 floating-point matrix. mat3 float [9] 3×3 floating-point matrix. mat4 float [16] 4×4 floating-point matrix. sampler1D int Handle for accessing a 1D texture. sampler2D int Handle for accessing a 2D texture. sampler3D int Handle for accessing a 3D texture. samplerCube int Handle for accessing a cubemap texture. sampler1DShadow int A handle for accessing a 1D depth texture with comparison. Sampler2DShadow int A handle for accessing a 2D depth texture with comparison.

magicse commented 1 year ago

I found declarations here gpu.cpp

nihui commented 1 year ago

I have only one question in shaders i saw this "afp v0" , sfp , "psc(c) ", "afpvec4" I didn't understand what this the types ... afp, psc , sfp, afpvec4? I couldn't find any information about this.

GLSL data type C data type Description bool int A conditional type, taking on values of true or false. int int Signed integer. float float Single floating-point scalar. vec2 float [2] Two component floating-point vector. vect3 float [3] Three component floating-point vector. vec4 float [4] Four component floating-point vector. bvec2 int [2] Two component Boolean vector. bvec3 int [3] Three component Boolean vector. bvec4 int [4] Four component Boolean vector. ivec2 int [2] Two component signed integer vector. ivec3 int [3] Three component signed integer vector. ivec4 int [4] Four component signed integer vector. mat2 float [4] 2×2 floating-point matrix. mat3 float [9] 3×3 floating-point matrix. mat4 float [16] 4×4 floating-point matrix. sampler1D int Handle for accessing a 1D texture. sampler2D int Handle for accessing a 2D texture. sampler3D int Handle for accessing a 3D texture. samplerCube int Handle for accessing a cubemap texture. sampler1DShadow int A handle for accessing a 1D depth texture with comparison. Sampler2DShadow int A handle for accessing a 2D depth texture with comparison.

under construction ... https://github.com/nihui/ncnn/blob/doc-glsl-ext/docs/developer-guide/glsl-extension.md

magicse commented 1 year ago

Hi @nihui , thank You for link and helping. Now i try create conv1d shader.... May it will be ready soon )))

nihui commented 1 year ago

Hi @nihui , thank You for link and helping. Now i try create conv1d shader.... May it will be ready soon )))

Hi, you can join ncnn qq group if you use qq (see ncnn readme) thru which I can provide more help in time

nihui commented 1 year ago

https://github.com/Tencent/ncnn/wiki/glsl-extension

magicse commented 1 year ago

Work in progress

convolution1d.comp for kernel_w > 1 and elempack 1

#version 450

#if NCNN_fp16_storage
#extension GL_EXT_shader_16bit_storage: require
#endif
#if NCNN_fp16_arithmetic
#extension GL_EXT_shader_explicit_arithmetic_types_float16: require
#endif

#extension GL_EXT_debug_printf : enable
#extension GL_GOOGLE_include_directive: enable
#include "vulkan_activation.comp"

layout (constant_id = 0) const int kernel_w = 1;
layout (constant_id = 1) const int dilation_w = 1;
layout (constant_id = 2) const int stride_w = 1;
layout (constant_id = 3) const int bias_term = 0;
layout (constant_id = 4) const int activation_type = 0;
layout (constant_id = 5) const float activation_param_0 = 0;
layout (constant_id = 6) const float activation_param_1 = 0;

#define shape_constant_id_offset 7
layout (constant_id = shape_constant_id_offset + 0) const int dims = 0;
layout (constant_id = shape_constant_id_offset + 1) const int w = 0;
layout (constant_id = shape_constant_id_offset + 2) const int h = 0;
layout (constant_id = shape_constant_id_offset + 3) const int c = 0;
layout (constant_id = shape_constant_id_offset + 4) const int cstep = 0;

layout (constant_id = shape_constant_id_offset + 5) const int outdims = 0;
layout (constant_id = shape_constant_id_offset + 6) const int outw = 0;
layout (constant_id = shape_constant_id_offset + 7) const int outh = 0;
layout (constant_id = shape_constant_id_offset + 8) const int outc = 0;
layout (constant_id = shape_constant_id_offset + 9) const int outcstep = 0;

#if NCNN_image_shader
layout (binding = 0) uniform unfp sampler2D bottom_blob;
layout (binding = 1, imfmtc1) writeonly uniform unfp image2D top_blob;
layout (binding = 2) uniform unfp sampler3D weight_blob;
layout (binding = 3) uniform unfp sampler3D bias_blob;
#else
layout (binding = 0) readonly buffer bottom_blob { sfp bottom_blob_data[]; };
layout (binding = 1) writeonly buffer top_blob { sfp top_blob_data[]; };
layout (binding = 2) readonly buffer weight_blob { sfp weight_data[]; };
layout (binding = 3) readonly buffer bias_blob { sfp bias_data[]; };
#endif

layout (push_constant) uniform parameter
{
   int dims;
   int w;
   int h;
   int c;
   int cstep;

   int outdims;
   int outw;
   int outh;
   int outc;
   int outcstep;
} p;

void print_bottom_blob()
{    
    int gx = int(gl_GlobalInvocationID.x);
    int gy = int(gl_GlobalInvocationID.y);
    int gz = int(gl_GlobalInvocationID.z);
    if (gx >= 1 || gy >= 1)
            return;
    debugPrintfEXT("Hello %i, %i\n", gx, gy);
    for (int i = 0; i < psc(w); ++i) {
        for (int j = 0; j < psc(h); ++j) {
        debugPrintfEXT("Elem %d %d: %f ", i, j, bottom_blob_data[i*psc(h)+j]);
        }
        debugPrintfEXT("\n");
    }
}

void main()
{

    int gx = int(gl_GlobalInvocationID.x) * 2;
    int gy = int(gl_GlobalInvocationID.y) * 2;
    int gz = int(gl_GlobalInvocationID.z) * 2;
    //print_bottom_blob();

    if (gx >= psc(outw) || gy >= psc(outh) || gz >= psc(outc))
        return;

    const ivec2 gx2 = gx + ivec2(0, 1);
    const ivec2 gy2 = gy + ivec2(0, 1);

    afp sum0 = afp(0.0f);
    afp sum1 = afp(0.0f);
    afp sum2 = afp(0.0f);
    afp sum3 = afp(0.0f);   

    if (bias_term == 1)
    {
#if NCNN_image_shader
        //sum = image2d_ld1(bias_blob, ivec2(gx, 0));
#else
    sum0 = buffer_ld1(bias_data, gy2.x);
    sum2 = buffer_ld1(bias_data, gy2.y);
    sum1 = sum0;
    sum3 = sum2;
#endif
    }

#if NCNN_image_shader
  //
#else
    ivec2 w_offsetv = kernel_w * psc(h) * gy2; //  weight offset
    for (int iny = 0; iny < psc(h); iny++)
    {
        ivec2 v_offsetv = iny * psc(w) + gx2 * stride_w; // value offset
        for (int x = 0; x < kernel_w; x++)
        {
            afp v0 = buffer_ld1(bottom_blob_data, v_offsetv.x + x * dilation_w); // Load the value +0
            afp v1 = buffer_ld1(bottom_blob_data, v_offsetv.y + x * dilation_w); // Load the value +1
            afp k0 = buffer_ld1(weight_data, w_offsetv.x + x); // Load the weight value +0
            afp k1 = buffer_ld1(weight_data, w_offsetv.y + x); // Load the weight value +1

            sum0 += v0 * k0;
            sum1 += v1 * k0;
            sum2 += v0 * k1;
            sum3 += v1 * k1;
        }
        w_offsetv += kernel_w; // Move to the next set of weights
    }
#endif  
    sum0 = activation_afp(sum0, activation_type, activation_param_0, activation_param_1);
    sum1 = activation_afp(sum1, activation_type, activation_param_0, activation_param_1);
    sum2 = activation_afp(sum2, activation_type, activation_param_0, activation_param_1);
    sum3 = activation_afp(sum3, activation_type, activation_param_0, activation_param_1);

#if NCNN_image_shader
    //image2d_st1(top_blob, ivec3(gx2.x, gy2.x, gz2.x), sum0);
    //image2d_st1(top_blob, ivec3(gx2.y, gy2.x, gz2.x), sum1);
    //image2d_st1(top_blob, ivec3(gx2.x, gy2.y, gz2.x), sum2);
    //image2d_st1(top_blob, ivec3(gx2.y, gy2.y, gz2.x), sum3);
#else
    if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.x * psc(outw) + gx2.x, sum0);
    if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.x * psc(outw) + gx2.y, sum1);
    if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.y * psc(outw) + gx2.x, sum2);
    if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st1(top_blob_data, gy2.y * psc(outw) + gx2.y, sum3);
#endif
}
magicse commented 1 year ago

My convolution1d.comp for kernel_w > 1 and elempack 1 ( unpacked float 32) work correct and produce correct results. Convolution1D_vulkan.cpp arranged as custom layer like And all work correctly . But i have one problem. create_pipeline procedure Convolution1D_vulkan::create_pipeline(const Option& _opt) receives all parameters correctly except bottom_shapes and top_shapes, it always empty. Main.cpp

#include "Convolution1D_vulkan.h"
DEFINE_LAYER_CREATOR(Convolution1D_vulkan)
....
ncnn::Net HIFIVOICE;
HIFIVOICE.opt.use_fp16_packed = false;
HIFIVOICE.opt.use_fp16_storage = false;
HIFIVOICE.opt.use_fp16_arithmetic = false;
HIFIVOICE.opt.use_int8_storage = false;
HIFIVOICE.opt.use_int8_arithmetic = false;
HIFIVOICE.opt.use_int8_packed = false;
HIFIVOICE.opt.use_vulkan_compute = true;
HIFIVOICE.register_custom_layer("Convolution1D_vulkan", Convolution1D_vulkan_layer_creator);
....
int Convolution1D_vulkan::create_pipeline(const Option& _opt)
{
    std::cout << "=== Create Pipeline: ===" << std::endl;
    if (dynamic_weight)
    {
        support_vulkan = false;
        support_image_storage = false;
        return 0;
    }

    // Create a convolution pipeline using Vulkan
    Option opt = _opt;
    const Mat& shape = bottom_shapes.empty() ? Mat() : bottom_shapes[0];
    const Mat& out_shape = top_shapes.empty() ? Mat() : top_shapes[0];
    std::cout << "=== Create Pipeline: ===" << std::endl;
    std::cout << "=== Create Pipeline: New shape from bottom_shapes WxHxC x Dims ===" << shape.w << " " << shape.h << " " << shape.c << " " << shape.d <<std::endl;
    std::cout << "=== Create Pipeline: New out_shape from top_shapes WxHxC x Dims ===" << out_shape.w << " " << out_shape.h << " " << out_shape.c << " " << out_shape.d <<std::endl;

Output

=== Create Pipeline: ===
=== Create Pipeline: New shape from bottom_shapes WxHxC x Dims ===0 0 0 0
=== Create Pipeline: New out_shape from top_shapes WxHxC x Dims ===0 0 0 0

=== Create Pipeline: padding pipeline pad_left pad_right : ===3 3
=== Create Pipeline: data_packed.create maxk : ===7
=== Create Pipeline: data_packed.create num_input : ===5120
=== Create Pipeline: data_packed.create elempack : ===4
=== Create Pipeline: data_packed.create num_input / elempack : ===1280
=== Create Pipeline: data_packed.create num_output : ===8
=== Create Pipeline: data_packed.create out_elempack : ===4
=== Create Pipeline: data_packed.create num_output / out_elempack : ===2
=== Create Pipeline: data_packed.create (size_t)4 * elempack * out_elempack : ===64
=== Create Pipeline: data_packed.create elempack * out_elempack : ===16
=== Create Pipeline: weight_data WxHxC : === 286720 x 1 x 1
=== Create Pipeline: weight_data WxHxC reshaped : === 7 x 5120 x 8
magicse commented 1 year ago

I made Convolution1d for the GPU (for float 32 pack1 and pack4 blobs and unpacked weights) And now a 7 second voice phrase is generated in 7 seconds. Good results. melgram_flipped

7 second voice phrase without GPU Inference duration:

------------------------
Inference duration: 177 seconds
Out matrix size W x H = 173312 x 1 number of channels 1

7 second voice phrase with GPU Inference duration:

------------------------
Inference duration: 7 seconds
Out matrix size W x H = 173312 x 1 number of channels 1

This is realtime.

magicse commented 1 year ago

convolution1d_pack4.comp (float 32 pack4 blobs and input unpacked weights)

#version 450

#if NCNN_fp16_storage
#extension GL_EXT_shader_16bit_storage: require
#endif
#if NCNN_fp16_arithmetic
#extension GL_EXT_shader_explicit_arithmetic_types_float16: require
#endif

//#extension GL_EXT_debug_printf : enable
#extension GL_GOOGLE_include_directive: enable
#include "vulkan_activation.comp"

layout (constant_id = 0) const int kernel_w = 1;
layout (constant_id = 1) const int dilation_w = 1;
layout (constant_id = 2) const int stride_w = 1;
layout (constant_id = 3) const int bias_term = 0;
layout (constant_id = 4) const int activation_type = 0;
layout (constant_id = 5) const float activation_param_0 = 0;
layout (constant_id = 6) const float activation_param_1 = 0;

#define shape_constant_id_offset 7
layout (constant_id = shape_constant_id_offset + 0) const int dims = 0;
layout (constant_id = shape_constant_id_offset + 1) const int w = 0;
layout (constant_id = shape_constant_id_offset + 2) const int h = 0;
layout (constant_id = shape_constant_id_offset + 3) const int c = 0;
layout (constant_id = shape_constant_id_offset + 4) const int cstep = 0;

layout (constant_id = shape_constant_id_offset + 5) const int outdims = 0;
layout (constant_id = shape_constant_id_offset + 6) const int outw = 0;
layout (constant_id = shape_constant_id_offset + 7) const int outh = 0;
layout (constant_id = shape_constant_id_offset + 8) const int outc = 0;
layout (constant_id = shape_constant_id_offset + 9) const int outcstep = 0;

#if NCNN_image_shader
layout (binding = 0) uniform unfp sampler3D bottom_blob;
layout (binding = 1, imfmtc4) writeonly uniform unfp image3D top_blob;
layout (binding = 2) uniform unfp sampler3D weight_blob;
layout (binding = 3) uniform unfp sampler3D bias_blob;
#else
//layout (binding = 0) readonly buffer bottom_blob { sfp bottom_blob_data[]; };
layout (binding = 0) readonly buffer bottom_blob { sfpvec4 bottom_blob_data[]; };

//layout (binding = 1) writeonly buffer top_blob { sfp top_blob_data[]; };
layout (binding = 1) writeonly buffer top_blob { sfpvec4 top_blob_data[]; };

//layout (binding = 2) readonly buffer weight_blob { sfp weight_data[]; };
//layout (binding = 3) readonly buffer bias_blob { sfp bias_data[]; };
#if NCNN_fp16_packed || (NCNN_fp16_storage && !NCNN_fp16_arithmetic)
layout (binding = 2) readonly buffer weight_blob { sfpvec4 weight_data[]; };
#else
//layout (binding = 2) readonly buffer weight_blob { sfpmat4 weight_data[]; };
//layout (binding = 2) readonly buffer weight_blob { sfpvec4 weight_data[]; };
layout (binding = 2) readonly buffer weight_blob { sfp weight_data[]; };

#endif
layout (binding = 3) readonly buffer bias_blob { sfpvec4 bias_data[]; };

#endif

layout (push_constant) uniform parameter
{
    int dims;
    int w;
    int h;
    int c;
    int cstep;

    int outdims;
    int outw;
    int outh;
    int outc;
    int outcstep;
} p;

/*
void print_bottblob()
{    
    int gx = int(gl_GlobalInvocationID.x);
    int gy = int(gl_GlobalInvocationID.y);
    int gz = int(gl_GlobalInvocationID.z);
    if (gx >= 1 || gy >= 1 || gz >= 1)
            return;
    //debugPrintfEXT("Hello %i, %i\n", gx, gy);
    for (int i = 0; i < psc(h)/4; ++i) {
        for (int j = 0; j < psc(w); ++j) {
        //for (int j = 0; j < psc(h); ++j) {
        //afp v = buffer_ld1(bottom_blob_data, 3);
        //debugPrintfEXT("Elem %d %d: %f ", i, j, v);

        //debugPrintfEXT("Bot_Blob %d %d: %f ", i, j, bottom_blob_data[i*psc(h)+j]);

        afpvec4 test = buffer_ld4(bottom_blob_data, i*psc(w)+j);
        debugPrintfEXT(" Top_Blob %d %d: %v4f ", i, j, test);

        //afpvec4 value;
        //value = buffer_ld4(bottom_blob_data, i*psc(h)+j );        
        //debugPrintfEXT("Bot_Blob %d %d: %f ", i, j, value);

        }
        debugPrintfEXT("\n");
    }
}

void print_weight()
{    
    int gx = int(gl_GlobalInvocationID.x);
    int gy = int(gl_GlobalInvocationID.y);
    int gz = int(gl_GlobalInvocationID.z);
    if (gx >= 1 || gy >= 1 || gz >= 1)
            return;
    debugPrintfEXT("Hello %i, %i\n", gx, gy);
    for (int i = 0; i < psc(outh)*4; ++i) {
        for (int j = 0; j < psc(outw)*kernel_w; ++j) {
        //afp v = buffer_ld1(bottom_blob_data, 3);
        //debugPrintfEXT("Elem %d %d: %f ", i, j, v);
        debugPrintfEXT("Weight %d %d: %f ", i, j, weight_data[i*psc(outw)*kernel_w+j]);
        //afpvec4 test = buffer_ld4(weight_data, i*psc(outw)+j);
        //debugPrintfEXT(" Weight %d %d: %v4f ", i, j, test);
        }
        debugPrintfEXT("\n");
    }
}

*/

void main()
{

    int gx = int(gl_GlobalInvocationID.x) * 2;
    int gy = int(gl_GlobalInvocationID.y) * 2;
    int gz = int(gl_GlobalInvocationID.z) * 2;

    //print_bottblob();
    //print_weight();

    if (gx >= psc(outw) || gy >= psc(outh) || gz >= psc(outc))
        return;

    const ivec2 gx2 = gx + ivec2(0, 1);
    const ivec2 gy2 = gy + ivec2(0, 1);
    const ivec2 gy4 = gy*4 + ivec2(0, 4);
    const ivec2 gz2 = gz + ivec2(0, 1);

    afpvec4 sum0 = afpvec4(0.0f);
    afpvec4 sum1 = afpvec4(0.0f);
    afpvec4 sum2 = afpvec4(0.0f);
    afpvec4 sum3 = afpvec4(0.0f);   

    afpvec4 sum4 = afpvec4(0.0f);
    afpvec4 sum5 = afpvec4(0.0f);

    afpvec4 sum6 = afpvec4(0.0f);
    afpvec4 sum7 = afpvec4(0.0f);
    afpvec4 sum8 = afpvec4(0.0f);
    afpvec4 sum9 = afpvec4(0.0f);   

    afpvec4 sum10 = afpvec4(0.0f);
    afpvec4 sum11 = afpvec4(0.0f);
    afpvec4 sum12 = afpvec4(0.0f);
    afpvec4 sum13 = afpvec4(0.0f);

    afpvec4 sum14 = afpvec4(0.0f);
    afpvec4 sum15 = afpvec4(0.0f);

    afpvec4 sum16 = afpvec4(0.0f);
    afpvec4 sum17 = afpvec4(0.0f);
    afpvec4 sum18 = afpvec4(0.0f);
    afpvec4 sum19 = afpvec4(0.0f);

    if (bias_term == 1)
    {
#if NCNN_image_shader
        sum = image2d_ld1(bias_blob, ivec2(gx, 0));
#else
        sum4 = buffer_ld4(bias_data, gy2.x);
        sum5 = sum4;
        sum14 = buffer_ld4(bias_data, gy2.y);
        sum15 = sum14;

#endif
    }

#if NCNN_image_shader
    //
#else

            ivec4 gy4_0 = gy4.x + ivec4(0, 1, 2, 3);
            ivec4 gy4_1 = gy4.y + ivec4(0, 1, 2, 3);

            ivec4 w_offsetv4_0;
            ivec4 w_offsetv4_1;
            w_offsetv4_0 = kernel_w * psc(h) * 4 * gy4_0;
            w_offsetv4_1 = kernel_w * psc(h) * 4 * gy4_1;

            for (int iny = 0; iny < psc(h); iny++)
            {

                ivec2 v_offsetv = iny * psc(w) + gx2 * stride_w;

                for (int x = 0; x < kernel_w; x++)
                {

                    afpvec4 v0 = buffer_ld4(bottom_blob_data, v_offsetv.x + x * dilation_w);
                    afpvec4 v1 = buffer_ld4(bottom_blob_data, v_offsetv.y + x * dilation_w);

                    afp k0 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 0); // Load the weight value
                    afp k1 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 1); // Load the weight value
                    afp k2 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 2); // Load the weight value
                    afp k3 = buffer_ld1(weight_data, (w_offsetv4_0.x + x) + kernel_w * 3); // Load the weight value

                    afp k4 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 0); // Load the weight value
                    afp k5 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 1); // Load the weight value
                    afp k6 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 2); // Load the weight value
                    afp k7 = buffer_ld1(weight_data, (w_offsetv4_0.y + x) + kernel_w * 3); // Load the weight value

                    afp k8 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 0); // Load the weight value
                    afp k9 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 1); // Load the weight value
                    afp k10 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 2); // Load the weight value
                    afp k11 = buffer_ld1(weight_data, (w_offsetv4_0.z + x) + kernel_w * 3); // Load the weight value

                    afp k12 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 0); // Load the weight value
                    afp k13 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 1); // Load the weight value
                    afp k14 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 2); // Load the weight value
                    afp k15 = buffer_ld1(weight_data, (w_offsetv4_0.w + x) + kernel_w * 3); // Load the weight value

                    afp k16 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 0); // Load the weight value
                    afp k17 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 1); // Load the weight value
                    afp k18 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 2); // Load the weight value
                    afp k19 = buffer_ld1(weight_data, (w_offsetv4_1.x + x) + kernel_w * 3); // Load the weight value

                    afp k20 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 0); // Load the weight value
                    afp k21 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 1); // Load the weight value
                    afp k22 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 2); // Load the weight value
                    afp k23 = buffer_ld1(weight_data, (w_offsetv4_1.y + x) + kernel_w * 3); // Load the weight value

                    afp k24 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 0); // Load the weight value
                    afp k25 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 1); // Load the weight value
                    afp k26 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 2); // Load the weight value
                    afp k27 = buffer_ld1(weight_data, (w_offsetv4_1.z + x) + kernel_w * 3); // Load the weight value

                    afp k28 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 0); // Load the weight value
                    afp k29 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 1); // Load the weight value
                    afp k30 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 2); // Load the weight value
                    afp k31 = buffer_ld1(weight_data, (w_offsetv4_1.w + x) + kernel_w * 3); // Load the weight value

#if NCNN_fp16_packed || (NCNN_fp16_storage && !NCNN_fp16_arithmetic)
                // GL_EXT_shader_16bit_storage does not define f16mat4 type :(
                afpmat4 k0 = afpmat4(
                    buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 0),
                    buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 1),
                    buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 2),
                    buffer_ld4(weight_data, (w_offsetv.x + x) * 4 + 3)
                );
                afpmat4 k1 = afpmat4(
                    buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 0),
                    buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 1),
                    buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 2),
                    buffer_ld4(weight_data, (w_offsetv.y + x) * 4 + 3)
                );
#else

#endif
                    //debugPrintfEXT(" k0, k1, k2, k3 %f, %f, %f, %f \n", k0, k1, k2, k3);
                    //debugPrintfEXT(" k4, k5, k6, k7 %f, %f, %f, %f \n", k4, k5, k6, k7);
                    sum0 += v0 * afpvec4(k0, k1, k2, k3); //* k0;
                    sum1 += v1 * afpvec4(k0, k1, k2, k3); //* k0;
                    sum2 += v0 * afpvec4(k4, k5, k6, k7); //* k1;
                    sum3 += v1 * afpvec4(k4, k5, k6, k7); //* k1;

                    sum6 += v0 * afpvec4(k8, k9, k10, k11); //* k0;
                    sum7 += v1 * afpvec4(k8, k9, k10, k11); //* k0;
                    sum8 += v0 * afpvec4(k12, k13, k14, k15); //* k1;
                    sum9 += v1 * afpvec4(k12, k13, k14, k15); //* k1;

                    sum10 += v0 * afpvec4(k16, k17, k18, k19); //* k0;
                    sum11 += v1 * afpvec4(k16, k17, k18, k19); //* k0;
                    sum12 += v0 * afpvec4(k20, k21, k22, k23); //* k1;
                    sum13 += v1 * afpvec4(k20, k21, k22, k23); //* k1;

                    sum16 += v0 * afpvec4(k24, k25, k26, k27); //* k0;
                    sum17 += v1 * afpvec4(k24, k25, k26, k27); //* k0;
                    sum18 += v0 * afpvec4(k28, k29, k30, k31); //* k1;
                    sum19 += v1 * afpvec4(k28, k29, k30, k31); //* k1;

                }

                w_offsetv4_0 += kernel_w*4;
                w_offsetv4_1 += kernel_w*4;
            }

            sum4.x += sum0.x + sum0.y + sum0.z + sum0.w;
            sum4.y += sum2.x + sum2.y + sum2.z + sum2.w;
            sum4.z += sum6.x + sum6.y + sum6.z + sum6.w;
            sum4.w += sum8.x + sum8.y + sum8.z + sum8.w;

            sum5.x += sum1.x + sum1.y + sum1.z + sum1.w;
            sum5.y += sum3.x + sum3.y + sum3.z + sum3.w;
            sum5.z += sum7.x + sum7.y + sum7.z + sum7.w;
            sum5.w += sum9.x + sum9.y + sum9.z + sum9.w;

            sum14.x += sum10.x + sum10.y + sum10.z + sum10.w;
            sum14.y += sum12.x + sum12.y + sum12.z + sum12.w;
            sum14.z += sum16.x + sum16.y + sum16.z + sum16.w;
            sum14.w += sum18.x + sum18.y + sum18.z + sum18.w;

            sum15.x += sum11.x + sum11.y + sum11.z + sum11.w;
            sum15.y += sum13.x + sum13.y + sum13.z + sum13.w;
            sum15.z += sum17.x + sum17.y + sum17.z + sum17.w;
            sum15.w += sum19.x + sum19.y + sum19.z + sum19.w;           

#endif  
    sum4 = activation_afpvec4(sum4, activation_type, activation_param_0, activation_param_1);
    sum5 = activation_afpvec4(sum5, activation_type, activation_param_0, activation_param_1);
    sum14 = activation_afpvec4(sum14, activation_type, activation_param_0, activation_param_1);
    sum15 = activation_afpvec4(sum15, activation_type, activation_param_0, activation_param_1);
#if NCNN_image_shader
    image2d_st1(top_blob, ivec3(gx2.x, gy2.x, gz2.x), sum0);
    image2d_st1(top_blob, ivec3(gx2.y, gy2.x, gz2.x), sum1);
    image2d_st1(top_blob, ivec3(gx2.x, gy2.y, gz2.x), sum2);
    image2d_st1(top_blob, ivec3(gx2.y, gy2.y, gz2.x), sum3);
#else
    if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.x * psc(outw) + gx2.x, sum4);
    if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.x * psc(outw) + gx2.y, sum5);
    if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.y * psc(outw) + gx2.x, sum14);
    if (gy + 1 < psc(outh) && gx + 1 < psc(outw)) buffer_st4(top_blob_data, gy2.y * psc(outw) + gx2.y, sum15);
#endif

}
magicse commented 1 year ago

I have finished creating working convolution1d_vulkan for fp32

opt.use_fp16_packed = false;
opt.use_fp16_storage = false;
opt.use_fp16_arithmetic = false;
opt.use_int8_storage = false;
opt.use_int8_arithmetic = false;
opt.use_int8_packed = false;

convolution1d.comp convolution1d_pack1to4.comp convolution1d_pack4.comp convolution1d_pack4to1.comp in progress

Inference duration for this mel spectrogram: 5 seconds

melgram_flipped

melgram_flipped

nihui commented 1 year ago

vulkan conv1d https://github.com/Tencent/ncnn/pull/5060

magicse commented 1 year ago

hi @nihui I tried new https://github.com/Tencent/ncnn/pull/5060/files conv1d comp shaders and layer (convolution1d_vulkan.cpp, convolution1d_vulkan.h) and it doesn't work correctly, I don't get sound. The output I get is just noise. Try my model ncnn-hifi-GAN with opt.use_vulkan_compute = true (i get noise) and opt.use_vulkan_compute = false (i get sound). But with my own shaders for fp32 It worked correctly and i get sound. Convolution1D_vulkan.cpp Convolution1D_vulkan.h convolution1d.comp convolution1d_pack1to4.comp convolution1d_pack4.comp

nihui commented 1 year ago

try disabling fp16

The following test print the same result on cpu and gpu

int main()
{
    ncnn::Net net;

    net.opt.use_vulkan_compute = true;
    // net.opt.use_vulkan_compute = false;

    net.opt.use_fp16_packed = false;
    net.opt.use_fp16_storage = false;
    net.opt.use_fp16_arithmetic = false;

    net.load_param("/home/nihui/osd/ncnn-nihui/mytools/hifivoice.ncnn.param");
    net.load_model("/home/nihui/osd/ncnn-nihui/mytools/hifivoice.ncnn.bin");

    {
        ncnn::Extractor ex = net.create_extractor();
        ex.set_vulkan_compute(false);

        ncnn::Mat in0 = RandomMat(64, 80);

        ex.input("in0", in0);

        ncnn::Mat out0;
        ex.extract("out0", out0);

        fprintf(stderr, "out0 %d %d %d %d %d\n", out0.dims, out0.w, out0.h, out0.d, out0.c);

        fprintf(stderr, "out0 %f %f %f %f %f %f\n", out0[0], out0[1], out0[10], out0[20], out0[200], out0[1020]);
    }

    {
        ncnn::Extractor ex = net.create_extractor();
        ex.set_vulkan_compute(true);

        ncnn::Mat in0 = RandomMat(64, 80);

        ex.input("in0", in0);

        ncnn::Mat out0;
        ex.extract("out0", out0);

        fprintf(stderr, "out0 %d %d %d %d %d\n", out0.dims, out0.w, out0.h, out0.d, out0.c);

        fprintf(stderr, "out0 %f %f %f %f %f %f\n", out0[0], out0[1], out0[10], out0[20], out0[200], out0[1020]);
    }

    return 0;
}
[nihui@nihuini-LC2 mytools]$ ./testnet 
[0 AMD Radeon Graphics (RADV NAVI14)]  queueC=1[4]  queueG=0[1]  queueT=0[1]
[0 AMD Radeon Graphics (RADV NAVI14)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[0 AMD Radeon Graphics (RADV NAVI14)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1
[0 AMD Radeon Graphics (RADV NAVI14)]  subgroup=64  basic/vote/ballot/shuffle=1/1/1/1
[0 AMD Radeon Graphics (RADV NAVI14)]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  queueC=0[1]  queueG=0[1]  queueT=0[1]
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  subgroup=8  basic/vote/ballot/shuffle=1/1/1/1
[1 llvmpipe (LLVM 16.0.6, 256 bits)]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0
out0 2 16384 1 1 1
out0 0.048641 0.074993 -0.038795 0.100711 -0.048371 0.117223
out0 2 16384 1 1 1
out0 0.048641 0.074993 -0.038795 0.100711 -0.048371 0.117224
magicse commented 1 year ago

Hi @nihui , thank you for your work. Now ncnn is open to new directions such as sound synthesis, voice conversion, music synthesis and TTS. I check your code and of course I get correct results.

Z:\AI_SDK\VAE-GAN\HIFIVoice_cpp>hifivoice.exe -i melgram_flipped.jpg

Input option value=melgram_flipped.jpg
path = melgram_flipped.jpgimagepath0: melgram_flipped.jpg
argv[0]: mel
argv[1]: melgram_flipped.jpg
[0 NVIDIA GeForce RTX 3060]  queueC=2[8]  queueG=0[16]  queueT=1[2]
[0 NVIDIA GeForce RTX 3060]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[0 NVIDIA GeForce RTX 3060]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1
[0 NVIDIA GeForce RTX 3060]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1
in0 2 64 80 1 1
out0 2 16384 1 1 1
out0 0.048641 0.074994 -0.038795 0.100711 -0.048369 0.117221
out0 2 16384 1 1 1
out0 0.048641 0.074994 -0.038795 0.100711 -0.048372 0.117223
Max mel magnitude val: 1.89804
Min mel magnitude val: -11
[677 x 80]; ch: 1
MelIn 3 677 80 1 1
Inference duration: 6 seconds
Out matrix size W x H = 173312 x 1 number of channels 1

Final
Z:\AI_SDK\VAE-GAN\HIFIVoice_cpp>

I also found what the problem was. convolution1d with vulkan=true and convolution1d with vulkan=false handle ncnn::Mat with an incorrect dimension differently.

For example, convolution1d is waiting for input dimension 2, and I passed ncnn:Mat with dimension 3.

convolution1d with vulkan=false treats ncnn::Mat with dimension 3 correctly as dimension 2, but convolution1d with vulkan=true produces the wrong result. My code was like this

     ncnn::Mat MelIn(melscpectro.cols, melscpectro.rows, 1, (void*)melscpectro.data);
     fprintf(stderr, "MelIn %d %d %d %d %d\n", MelIn.dims, MelIn.w, MelIn.h, MelIn.d, MelIn.c);

and I was getting an erroneous result with vulkan=true because dims=3

MelIn 3 677 80 1 1

Now I have changed the code

     ncnn::Mat MelIn(melscpectro.cols, melscpectro.rows, (void*)melscpectro.data);
     fprintf(stderr, "MelIn %d %d %d %d %d\n", MelIn.dims, MelIn.w, MelIn.h, MelIn.d, MelIn.c);

Output:

MelIn 2 677 80 1 1

and I get the correct result with convolution1d vulkan=true Thank you again @nihui for your work !!