davisking / dlib

A toolkit for making real world machine learning and data analysis applications in C++
http://dlib.net
Boost Software License 1.0
13.46k stars 3.37k forks source link

QUESTION : is yolov3 possible in DLIB #2211

Closed pfeatherstone closed 3 years ago

pfeatherstone commented 3 years ago

I am trying to define yolov3 using dlib's dnn module. I'm stuck with the darknet53 backbone, as I want it to output the outputs of the last three layers. So far i have this:

using namespace dlib;

template <int outc, int kern, int stride, typename SUBNET> 
using conv_block = leaky_relu<affine<con<outc,kern,kern,stride,stride,SUBNET>>>;

template <int inc, typename SUBNET>
using resblock = add_prev1<conv_block<inc,3,1,conv_block<inc/2,1,1,tag1<SUBNET>>>>;

template<int nblocks, int outc, typename SUBNET>
using conv_resblock = repeat<nblocks, resblock<outc,
                      conv_block<outc, 3, 2, SUBNET>>>;

template<typename SUBNET>
using darknet53 = tag3<conv_resblock<4, 1024,
                  tag2<conv_resblock<8, 512,
                  tag1<conv_resblock<8, 256,
                  conv_resblock<2, 128,
                  conv_resblock<1, 64,
                  conv_block<32, 3, SUBNET
                  >>>>>>>>>;

Is it possible for darknet53 to output tag1, tag2 and tag3?

pfeatherstone commented 3 years ago

Actually, having the tags there is enough to get me going and define the rest of the network. BUT, yolov3 has three yolo layers. So unless i apply grid offsets, anchors, permute dimensions, and concatenate everything at the end, the network will have to output three tensors anyway at the end.

arrufat commented 3 years ago

I have never tried it but, if you don't want to do all this reshaping and concatenation madness, and you know the tags of the layers you're interested in, I guess you can always access them directly from the the loss layer by doing something like this:

template <
    typename const_label_iterator,
    typename SUBNET
    >
double compute_loss_value_and_gradient (
    const tensor& input_tensor,
    const_label_iterator truth, 
    SUBNET& sub
) const
{
    const tensor& out1 = layer<tag1>(sub).get_output();
    const tensor& out2 = layer<tag2>(sub).get_output();
    const tensor& out3 = layer<tag3>(sub).get_output();
    // ...
}

And then apply the yolo layer to each output.

pfeatherstone commented 3 years ago

Do i need a loss layer if i'm only interested in inference? My goal is to port the weights from darknet to a dlib-defined yolov3 network. If not, can i just tag the output layers I want, then forward the input at the front of the network, then get the outputs i want using layer<tagx>(sub).get_output() ?

davisking commented 3 years ago

Right. You would write your loss function so it goes and grabs the tags you are interested in.

But if you don’t want to train then yeah. Just access the later you want and look at its outputs.

pfeatherstone commented 3 years ago

Just noticed that repeat only takes template<typename> class as the repeated layer. So it's not letting me use it with resblock as it has template <int inc, typename SUBNET> as template signature. Have i missed something?

pfeatherstone commented 3 years ago

All the examples that use repeat have the template<typename> class signature

arrufat commented 3 years ago

Yes, the repeat layer only takes a template <typename SUBNET> class. You can have a look at my definition of the Darknet53 bacbkbone here, where I predefine some things to be able to use them with the repeat layer.

pfeatherstone commented 3 years ago

So i have this so far:

using namespace dlib;

template <template <typename> class BN>
struct yolo
{
    template <int outc, int kern, int stride, typename SUBNET> 
    using conv_block = leaky_relu<BN<con<outc,kern,kern,stride,stride,SUBNET>>>;

    template <int outc, typename SUBNET>
    using resblock = add_prev1<conv_block<outc,3,1,conv_block<outc/2,1,1,tag1<SUBNET>>>>;

    template <typename SUBNET> using res1024 = resblock<1024,SUBNET>;
    template <typename SUBNET> using res512  = resblock<512,SUBNET>;
    template <typename SUBNET> using res256  = resblock<256,SUBNET>;
    template <typename SUBNET> using res128  = resblock<128,SUBNET>;

    template <typename SUBNET> using block5 = repeat<4,res1024, conv_block<1024,3,2,SUBNET>>;
    template <typename SUBNET> using block4 = repeat<8,res512,  conv_block<512,3,2,SUBNET>>;
    template <typename SUBNET> using block3 = repeat<8,res256,  conv_block<256,3,2,SUBNET>>;
    template <typename SUBNET> using block2 = repeat<2,res128,  conv_block<128,3,2,SUBNET>>;
    template <typename SUBNET> using block1 = resblock<64,conv_block<64,3,2,SUBNET>>;

    using darknet53 = tag1<block5<
                      tag2<block4<
                      tag3<block3<
                      block2<
                      block1<
                      conv_block<32,3,1, 
                      input_rgb_image
                      >>>>>>>>>;    

    template<int outc, int nclasses, int tag, int yolo_tag, typename SUBNET>
    using detection_block = add_tag_layer<yolo_tag, con<3*(nclasses + 5), 1, 1, 1, 1,   //conv7 - yolo output
                            conv_block<outc,   3, 1,                                    //conv6
                            add_tag_layer<tag, conv_block<outc/2, 1, 1,                 //conv5 - branch output
                            conv_block<outc,   3, 1,                                    //conv4
                            conv_block<outc/2, 1, 1,                                    //conv3
                            conv_block<outc,   3, 1,                                    //conv2
                            conv_block<outc/2, 1, 1,                                    //conv1
                            SUBNET
                            >>>>>>>>>;

    template<int nclasses>
    using yolov3 =
            detection_block<256,nclasses,8,12,  //8 is the branch tag (don't care here), 12 is a yolo tag
            concat2<skip7, skip3,                 //concat last layer with tag3 from darknet backbone
            tag7<upsample<2,
            conv_block<128, 1, 1,
            skip6<
            detection_block<512,nclasses,6,11,  //6 is the branch tag, 11 is a yolo tag
            concat2<skip5, skip2,                 //concat last layer with tag2 from darknet backbone
            tag5<upsample<2,
            conv_block<256, 1, 1,
            skip4<                              //pick branch with tag 4
            detection_block<1024,nclasses,4,10, //4 is the branch tag, 10 is a yolo_tag
            skip1<
            darknet53
            >>>>>>>>>>>>>>;
};

This compiles. That's progress. The API is hurting my brain a bit though.

pfeatherstone commented 3 years ago

@arrufat @davisking Is there a way to turn bias off in conv_block. Since conv_block has a batchnormalisation layer, which already has a bias term, we don't want double biases.

arrufat commented 3 years ago

@arrufat @davisking Is there a way to turn bias off in conv_block. Since conv_block has a batchnormalisation layer, which already has a bias term, we don't want double biases.

Yes! That feature was added not that long ago in #2156, you just do:

set_all_bn_inputs_no_bias(net);

And it will do it automatically for the whole network.

pfeatherstone commented 3 years ago

Will it do the same to affine layers ?

pfeatherstone commented 3 years ago

I'm not training, simply porting weights from darknet. So i don't need to use bn_con layers.

arrufat commented 3 years ago

concat_ layers need tags as inputs. It compiles for me with this change:

concat2<tag7, tag3, SUBNET
pfeatherstone commented 3 years ago

Cheers thank you. Getting closer to working.

arrufat commented 3 years ago

Will it do the same to affine layers ?

No, that visitor only works with bn_ layers that have either con_ or fc_ layers as inputs.

arrufat commented 3 years ago

Cheers thank you. Getting closer to working.

I am very interested in this if you manage to deserialize the darknet weights and make them work with dlib.

Also, check the paddings of the 3x3 convolutions with a stride of 2. They are 0 by default in dlib, but they need to be 1 in yolo. That is why I defined this.

pfeatherstone commented 3 years ago

Presumably, to port the weights, i will have to use a visitor?

pfeatherstone commented 3 years ago

Is there a layer for permuting dimensions? Can extract be used? I need to go from a tensor of shape 1x255x13x13 to 1x3x85x13x13, then to 1x13x13x3x85.

pfeatherstone commented 3 years ago

Also to get the exact same results as darknet, we need a layer similar to upsample that uses a "nearest" method, not bilinear interpolation.

arrufat commented 3 years ago

Presumably, to port the weights, i will have to use a visitor?

Yes, at least that's how I would approach it, in particular I would use visit_layers_backwards.

Is there a layer for permuting dimensions?

You can try with extract_ + some extra manipulation.

arrufat commented 3 years ago

I have a WIP project where I try to implement YOLOv1 (as a start) but haven't been very active lately. You can check it out: https://github.com/arrufat/yolo-dlib

EDIT: it's still WIP and it doesn't work, although the training runs...

pfeatherstone commented 3 years ago

I can do reshaping, applying grid offset and anchors post processing using pointer arithmetics and stuff. The only thing left to do is porting weights. This is all an experiment to benchmark yolov3 with dlib. Defining a loss function for yolov3 in dlib is going to be too hard and you can train in darknet or pytorch anyway.

pfeatherstone commented 3 years ago

oh and there is the disabling of biases in affine layers, and implemented a "nearest" method for upsample layer. so 3 things to do.

arrufat commented 3 years ago

You can disable bias for affine layers easly using the new style visitor with a lambda.

pfeatherstone commented 3 years ago

Ok. Is there an example of this? Also is there a way of setting avg_red, avg_green and avg_blue for input_rgb_image layer?

pfeatherstone commented 3 years ago

Actually, just seen the code for bn_conv, looks fine.

arrufat commented 3 years ago

Ok. Is there an example of this? Also is there a way of setting avg_red, avg_green and avg_blue for input_rgb_image layer?

https://github.com/davisking/dlib/blob/a1f158379e2f328e8697b63ad653926594c8a771/examples/dnn_dcgan_train_ex.cpp#L135

You should read the documentation of the input layers.

pfeatherstone commented 3 years ago

It's possible i've missed something in the docs for the input layers. I've just used this instead:

struct input_rgb_image_zero_means : input_rgb_image
    {
        input_rgb_image_zero_means() : input_rgb_image(0,0,0) {}
    };
arrufat commented 3 years ago

You should also read dnn_introduction2_ex. You will learn that you can initialize the layers of a network by passing them when constructing the network, like this:

net_type net(input_rgb_image(0, 0, 0));
pfeatherstone commented 3 years ago

ah ok fair enough. Though using input_rgb_image_zero_means means it's impossible to use it incorrectly.

pfeatherstone commented 3 years ago

@arrufat The API doesn't expose the layer parameters reliably. Indeed the get_layer_params function for the affine_ layer spits back empty_params. So i can't set the weights using get_layer_params. It looks like i have to serialize some weights to a temporary stream then call deserialize on that layer using that stream. What do you think?

I have the following visitor:

struct darknet_visitor
{
    darknet_visitor(const char* darknet_weights)
    :   w(darknet_weights, std::ios::binary)
    {
        assert(w.is_open());
        int32_t major, minor, dummy;
        int64_t dummy2;
        w >> major >> minor >> dummy;
        if ((major * 10 + minor) >= 2 && major < 1000 && minor < 1000)
            w >> dummy2;
        else
            w >> dummy;
        cout << "weights file major " << major << " minor" << minor << endl;
    }

    template<typename T>
    void operator()(size_t idx, T& t)
    {
    }

    template <typename SUBNET>
    void operator()(size_t idx, add_layer<affine_, SUBNET>& l)
    {
        cout << "affine layer " << idx << endl;
        auto& bn    = l.layer_details();
        auto& conv  = l.subnet().layer_details();
        //1. bn bias
        //2. bn weight
        //3. bn running mean
        //4. bn running var
        //5. conv weight
//        tensor& bn_t = bn.get_layer_params(); //THIS IS EMPTY BECAUSE affine_t spits back empty_params
        stringstream ss;
        ss << ...;
        deserialize(bn.get_layer_params(), ss);
        ss << ...;
        deserialize(conv.get_layer_params(), ss);
    }

    template <
        long outc,
        long nr,
        long nc,
        int sy,
        int sx,
        int py,
        int px,
        typename SUBNET
        >
    void operator()(size_t idx, add_layer<con_<outc,nr,nc,sy,sx,py,px>,SUBNET>& l)
    {
        auto& conv = l.layer_details();

        if (!conv.bias_is_disabled())
        {
            cout << "con layer " << idx << endl;
            //1. conv bias
            //2. conv weight
            stringstream ss;
            ss << ...;
            deserialize(conv.get_layer_params(), ss);
        }
    }

    std::ifstream w;
};

which i call using:

visit_layers_backwards(net, darknet_visitor("yolov3.weights"));    
pfeatherstone commented 3 years ago

It looks like the get_layer_params() for bn_ and con_ return params. So maybe, i have to first define a model using bn_con, then do the porting of weights, then replace all the bn_con layers with affine. Hmm, getting complicated.

arrufat commented 3 years ago

After declaring the network, you can forward some dummy input to initialize the params of the layers, then run the visitor.

pfeatherstone commented 3 years ago

but that still doesn't solve the problem with affine_. Do you have to use bn_ first to port the weights? Also, since all the weights are alias_tensor types that use params for storage, and get_layer_params returns params, it's not entirely obvious how to port the weights to params. Do you suggest using serialize and deserialize ? Or maybe there should be new functionality added to the dnn module to make all this possible. For example have a port_weights visitor type, designed for this use case, which is made a friend type for all layers. Then we can have access to all the underlying tensors.

arrufat commented 3 years ago

I would not use serialize/deserialize for this. I would do something like:

auto& params = l.get_layer_params();
float* p = params.host();

And then read the weights from the yolo file and store them in p. However, I did not check in which order the weights are stored in darknet, you have to check that and skip or reshape to your needs.

pfeatherstone commented 3 years ago

yep so i know exactly how the weights are stored in darknet format. The problem with what you suggest is that using auto& params = l.get_layer_params(); for affine_ layer will not work since it returns an empty tensor that is never used

pfeatherstone commented 3 years ago

Furthermore params is used as a storage tensor. The actual weights inside the layer classes are all alias_tensor types. So setting params correctly is very difficult.

arrufat commented 3 years ago

alias_tensor is just a view into the tensor, to be able to access the weights from the convolution kernel and the biases more easily for example. But everything is stored in the same tensor returned by get_layer_params(), as far as I know. If you initialize the network with a dummy input, then the .get_layer_params() for the affine layer should not be empty, and if you print its values, you will see some ones (gamma) followed by some zeros (beta). https://github.com/davisking/dlib/blob/a1f158379e2f328e8697b63ad653926594c8a771/dlib/dnn/layers.h#L2166-L2185

pfeatherstone commented 3 years ago

If you look inside layers.h, you will see this:

const tensor& get_layer_params() const { return empty_params; }
        tensor& get_layer_params() { return empty_params; }

for affine_

pfeatherstone commented 3 years ago

And empty_params is never set

arrufat commented 3 years ago

Oh, I skipped that, so then you need to define yolo changing the template parameter to bn_con, load the weights and then assign it to the yolo model declared with affine:

yolo<bn_con>::yolov3 net;

// visitor

yolo<affine>::yolov3 net2(net);

And in the visitor you should initialize the missing values from yolo to something sensible.

pfeatherstone commented 3 years ago

Ok thought so. That's what i was going on about a few comments ago. Thank you.

arrufat commented 3 years ago

In my head affine had a learnable gamma and beta, but it turns out it doesn't, sorry about that.

pfeatherstone commented 3 years ago

This is what i have so far. It compiles but doesn't work. Porting the weights shows that the correct number of bytes is read from the file. So it looks like the network structure is correct and the interpretation of the weights is correct. But could be wrong. Maybe there's an error with endianness. Not sure. Please try it and see if you can spot the errors.

main.cpp.txt

pfeatherstone commented 3 years ago

Warning: it takes roughly 60 seconds to compile main.cpp. Sigh...

arrufat commented 3 years ago

I have been able to build it and run it, but at a first glance I didn't see anything odd... I'll have another look later. Thanks for sharing :)

pfeatherstone commented 3 years ago

The detections are all wrong. So a bit stuck as to where the errors are. Model size is correct, and the visitor is reading the correct number of bytes. @arrufat if you find a fix, please post

pfeatherstone commented 3 years ago

Possibly need to inspect the output of every layer and compare side by side with either darknet or pytorch implementation.

pfeatherstone commented 3 years ago
template <
            layer_mode bnmode
            >
        affine_(
            const bn_<bnmode>& item
        )
        {
            gamma = item.gamma;
            beta = item.beta;
            mode = bnmode;

            params.copy_size(item.params);

            auto g = gamma(params,0);
            auto b = beta(params,gamma.size());

            resizable_tensor temp(item.params);
            auto sg = gamma(temp,0);
            auto sb = beta(temp,gamma.size());

            g = pointwise_divide(mat(sg), sqrt(mat(item.running_variances)+item.get_eps()));
            b = mat(sb) - pointwise_multiply(mat(g), mat(item.running_means));
        }

Why is this happening:

g = pointwise_divide(mat(sg), sqrt(mat(item.running_variances)+item.get_eps()));
b = mat(sb) - pointwise_multiply(mat(g), mat(item.running_means));

??

This could be my problem. I can't set running_variances or running_means since get_layer_params in bn_con only gives me gamma and beta.

pfeatherstone commented 3 years ago

Ok. Fixed it. Had to manually adjust gamma and beta using running_variances and running_means. All works now. here is the code:

main.cpp.txt

Now if someone can write the training code with a loss function that uses GIOU, DIOU and CIOU losses, that would be great :) :) (@arrufat ??) Implementing GIOU and company in a framework that supports auto-grad is trivial. Since in dlib you have to manually write the backward passes, I'm likely to make some mistakes with all those derivatives.