facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.94k forks source link

how to convert ssd/yolo to caffe2 #509

Open 408550969 opened 7 years ago

408550969 commented 7 years ago

Can ssd be converted from caffe to caffe2?Are there any specific tutorials?

KleinYuan commented 7 years ago

@408550969

I actually managed to do tiny-yolo via translating tiny-yolo's caffemodel to caffe2 pb files last night.

Found this guy got caffe models of Yolos here Then, you probably also need to resolve this issue of the translator in the PR Last, run the translator.

I haven't got a chance to test the how accurate the file is since I am still in the stage of trying to fit those models into Caffe2-iOS.

408550969 commented 7 years ago

Thanks!

408550969 commented 7 years ago

@KleinYuan when I convert tiny-yolo.voc.weights from Darknet to Caffe,There is a problem: Transpose fc layers:False converting conv1 converting conv2 converting conv3 converting conv4 converting conv5 converting conv6 converting conv7 converting conv8 converting conv9 Converted 15867885 weights. and there is Nothing generated.Do you have met this problem?

KleinYuan commented 7 years ago

@408550969 hey man, I haven't got time to do darknet weight --> caffe model part on my own, I instead directly used this ones.

KeyKy commented 7 years ago

I transfer NormOp and PriorBoxOp of ssd caffe to caffe2 (only forward). And these two operators are tested simply and maybe have bugs somewhere, test examples are in folder. Now I am struggling to implement the detection_layer.

408550969 commented 7 years ago

@KleinYuan sorry to disturb you again,I have solved the problem before, I can now run yolo-tiny on the caffe,when I convert yolo to caffe2,I see this error: ImportError: No module named caffe2.python I konw it is a pythonpath issue,I copy the D:\caffe2-master\caffe2 to E:\anaconda\Lib\site-packages,but is still can‘t run. and I couldn't find the caffe2_pybind11_state.dll, How is it generated?

408550969 commented 7 years ago

@KeyKy Thank you, looking forward to your progress!

KeyKy commented 7 years ago

I finish the detection_output_op and caffe_translator for translate VGG_VOC0712_SSD_300x300_ft_iter_120000.caffemodel to caffe2, maybe have bugs somewhere. I have visualize one example of caffe2's output. If you found new bugs, please fixed it and inform me. These codes are for my interest of caffe2 and sorry for all messy code. I will continue to do some updates in my spare time.


update: all things move to here

tianyangma commented 7 years ago

@408550969 caffe2_pybind11_state.dll can be located at $ROOT\build\caffe2\python\Release

408550969 commented 7 years ago

@keyky when I convert ssd from caffe to caffe2,there is a problem: google.protobuf.text_format.ParseError: 814:3 : Message type "caffe.LayerParameter" has no field named "norm_param".

408550969 commented 7 years ago

@tianyangma Thanks

KeyKy commented 7 years ago

@408550969 Did you make install caffe2?

408550969 commented 7 years ago

@KeyKy I've installed caffe2 on Ubuntu

KeyKy commented 7 years ago

@408550969 Because i modified the caffe.proto, you should git clone my caffe2 and make install. If you want to use the latest caffe2, you can check my commits and modify by yourself.

408550969 commented 7 years ago

@KeyKy Sorry I am a bit busy a few days ago, the problem has been successfully resolved. Thanks for your help! In addition, there is a presumptuous request, can you give me some guidance on how to modify network? Do you have tutorials in this field? I want to write a paper on this subject, but I don't know how to modify it. I can't find a similar tutorial on the internet.I can only change deploy file right now.

hudongloop commented 7 years ago

@KeyKy I re-building your code of "convert ssd/yolo to caffe2" in raspberry, it word well. But wen I converting to caffe2 deploy.pb and model.pb, the log show: WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode. WARNING:root:Debug message: No module named caffe2_pybind11_state_gpu INFO:caffe_translator:Translate layer conv1_1 INFO:caffe_translator:Translate layer relu1_1 INFO:caffe_translator:Translate layer conv1_2 INFO:caffe_translator:Translate layer relu1_2 INFO:caffe_translator:Translate layer pool1 INFO:caffe_translator:Translate layer conv2_1 INFO:caffe_translator:Translate layer relu2_1 INFO:caffe_translator:Translate layer conv2_2 INFO:caffe_translator:Translate layer relu2_2 INFO:caffe_translator:Translate layer pool2 INFO:caffe_translator:Translate layer conv3_1 INFO:caffe_translator:Translate layer relu3_1 INFO:caffe_translator:Translate layer conv3_2 INFO:caffe_translator:Translate layer relu3_2 INFO:caffe_translator:Translate layer conv3_3 INFO:caffe_translator:Translate layer relu3_3 INFO:caffe_translator:Translate layer pool3 INFO:caffe_translator:Translate layer conv4_1 INFO:caffe_translator:Translate layer relu4_1 INFO:caffe_translator:Translate layer conv4_2 INFO:caffe_translator:Translate layer relu4_2 INFO:caffe_translator:Translate layer conv4_3 INFO:caffe_translator:Translate layer relu4_3 INFO:caffe_translator:Translate layer pool4 INFO:caffe_translator:Translate layer conv5_1 INFO:caffe_translator:Translate layer relu5_1 INFO:caffe_translator:Translate layer conv5_2 Killed Is this the question of GPU? And must building whit GPU. Thanks!

KeyKy commented 7 years ago

@huodng yes. For now my code only support caffe2 with GPU. It seems that you does not build caffe2 with gpu support.

gjm441 commented 7 years ago

@408550969 I also meet the same problem.How do you resolve it? google.protobuf.text_format.ParseError: 814:3 : Message type "caffe.LayerParameter" has no field named "norm_param".

fbadaud commented 7 years ago

hello @keyky I am looking at your solution in order to add the missing element to my Caffe2 installation in order to be able to convert SSD model but on CPU only. can you advice a bit what should I do to make this update?

for the moment I just copy your operators/norm_op. and prior_box_op to my caffe2/operators/ and then redo a build with the make command.

when trying again to convert , it still not understand the prior_box and norm from the deploy.prototxt.

KeyKy commented 7 years ago

@fbadaud maybe you should add something in caffe.proto

fbadaud commented 7 years ago

thanks a lot @Keyky , I add the modification in your caffe/proto/caffe.proto and recompile caffe2 now the error on prior_box_op is solved but a new one occurs on nms_threshold. Anyway, your advice and job on the missing files were very useful, I am looking at the new error now.

fbadaud commented 7 years ago

Hi, I found a solution to this last issue by modifying the deploy.prototxt with addition on detection_output_param: nms_param { nms_threshold : 0.45 top_k: 400 } so to summarize in addition to this:

now I am able to convert the SSD VGG based model from Caffe to Caffe2

KeyKy commented 7 years ago

@fbadaud You are on the right direction. However I can not solve the problem of running SSD in both CPU and GPU.

RailWar commented 6 years ago

I made some changes in code and now whole model run in CUDA context. Only one thing rests. Output of DetectionOutputOp rests in CPU context and part of code uses bbox_util.cc (bbox_util.cu is empty). My work is continuing

RailWar commented 6 years ago

Here are my units (some changes needs for MSVC) Some problems with formatting. I use "Insert code" , but first lines are not in box

prior_box_op.h

`#ifndef PRIOR_BOX_OPH

define PRIOR_BOX_OPH

include "caffe2/core/context.h"

include "caffe2/core/logging.h"

include "caffe2/core/operator.h"

include "caffe2/utils/math.h"

namespace caffe2 {

template <typename T, class Context>
class PriorBoxOp final : public Operator<Context> {
public:
    USE_OPERATOR_CONTEXT_FUNCTIONS;
    PriorBoxOp(const OperatorDef& operator_def, Workspace* ws)
        : Operator<Context>(operator_def, ws),
        min_sizes_(OperatorBase::GetRepeatedArgument<float>("min_sizes")),
        max_sizes_(OperatorBase::GetRepeatedArgument<float>("max_sizes")),
        aspect_ratios_(OperatorBase::GetRepeatedArgument<float>("aspect_ratios")),
        flip_(OperatorBase::GetSingleArgument<bool>("flip", true)),
        clip_(OperatorBase::GetSingleArgument<bool>("clip", false)),
        variance_(OperatorBase::GetRepeatedArgument<float>("variance")),
        img_size_(OperatorBase::GetSingleArgument<int>("img_size", 0)),
        img_w_(OperatorBase::GetSingleArgument<int>("img_w", 0)),
        img_h_(OperatorBase::GetSingleArgument<int>("img_h", 0)),
        step_(OperatorBase::GetSingleArgument<float>("step", 0.)),
        step_h_(OperatorBase::GetSingleArgument<float>("step_h", 0.)),
        step_w_(OperatorBase::GetSingleArgument<float>("step_w", 0.)),
        offset_(OperatorBase::GetSingleArgument<float>("offset", 0.5)),
        order_(StringToStorageOrder(
            OperatorBase::GetSingleArgument<string>("order", "NCHW"))) {

        // check order
        CAFFE_ENFORCE_EQ(order_,
            StorageOrder::NCHW, "Only NCHW order is supported right now.");

        // check min_sizes
        CAFFE_ENFORCE_GT(min_sizes_.size(), 0);
        for (int i = 0; i < min_sizes_.size(); i++) {
            CAFFE_ENFORCE_GT(min_sizes_[i], 0);
        }

        // check max_sizes
        if (max_sizes_.size() > 0)
            CAFFE_ENFORCE_EQ(min_sizes_.size(), max_sizes_.size());

        // set aspts_
        aspts_.clear();
        aspts_.push_back(1.);
        for (int i = 0; i < aspect_ratios_.size(); i++) {
            float ar = aspect_ratios_[i];
            bool already_exist = false;
            for (int j = 0; j < aspts_.size(); j++) {
                if (fabs(ar - aspts_[j]) < 1e-6) {
                    already_exist = true;
                    break;
                }
            }
            if (!already_exist) {
                aspts_.push_back(ar);
                if (flip_) {
                    aspts_.push_back(1. / ar);
                }
            }
        }

        // set num_priors and check max_sizes
        // (aspect_ratios_.size + 1) * min_sizes_.size + max_sizes_.size
        num_priors_ = aspts_.size() * min_sizes_.size();
        if (max_sizes_.size() > 0) {
            CAFFE_ENFORCE_EQ(max_sizes_.size(), min_sizes_.size());
            for (int i = 0; i < max_sizes_.size(); i++) {
                CAFFE_ENFORCE_GT(max_sizes_[i], min_sizes_[i]);
                num_priors_ += 1;
            }
        }

        // check and set variance
        if (variance_.size() > 1) {
            CAFFE_ENFORCE_EQ(variance_.size(), 4);
            for (int i = 0; i < variance_.size(); i++) {
                CAFFE_ENFORCE_GT(variance_[i], 0);
            }
        }
        else if (variance_.size() == 1) {
            CAFFE_ENFORCE_GT(variance_[0], 0);
        }
        else {
            variance_.push_back(0.1f);
        }

        // check img_size and set img_h_,img_w_
        if (img_h_ != 0 || img_w_ != 0) {
            CAFFE_ENFORCE_EQ(img_size_, 0);
            CAFFE_ENFORCE_GT(img_h_, 0);
            CAFFE_ENFORCE_GT(img_w_, 0);
        }
        else if (img_size_ != 0) {
            CAFFE_ENFORCE_GT(img_size_, 0);
            img_h_ = img_size_;
            img_w_ = img_size_;
        }
        else {
            img_h_ = 0;
            img_w_ = 0;
        }

        // check step and set step_h_, step_w_
        if (step_h_ != 0. || step_w_ != 0.) {
            CAFFE_ENFORCE_EQ(step_, 0.);
            CAFFE_ENFORCE_GT(step_h_, 0.);
            CAFFE_ENFORCE_GT(step_w_, 0.);
        }
        else if (step_ != 0.) {
            CAFFE_ENFORCE_GT(step_, 0.);
            step_h_ = step_;
            step_w_ = step_;
        }
        else {
            step_h_ = 0.;
            step_w_ = 0.;
        }

    }
    bool RunOnDevice() override;

protected:
    vector<float> min_sizes_;
    vector<float> max_sizes_;
    vector<float> aspect_ratios_;
    vector<float> aspts_;

    bool flip_;
    int num_priors_;
    bool clip_;
    vector<float> variance_;

    int img_size_;
    int img_w_;
    int img_h_;
    float step_;
    float step_w_;
    float step_h_;
    float offset_;
    StorageOrder order_;
};

} // namespace caffe2

endif // PRIOR_BOX_OPH

`

RailWar commented 6 years ago

prior_box_op.cc

`#include "caffe2/operators/prior_box_op.h"

namespace caffe2 {

template <typename T, class CPUContext>
bool PriorBoxOp<T, CPUContext>::RunOnDevice() {
    const auto& X = Input(0);
    const auto& data = Input(1);
    auto* Y = Output(0);
    //auto* Y = OperatorBase::Output<TensorCPU>(0);

    const int layer_height = X.dim(2);
    const int layer_width = X.dim(3);
    int img_height, img_width;
    if (img_h_ == 0 || img_w_ == 0) {
        img_height = data.dim(2);
        img_width = data.dim(3);
    }
    else {
        img_height = img_h_;
        img_width = img_w_;
    }
    float step_w, step_h;
    if (step_w_ == 0 || step_h_ == 0) {
        step_h = static_cast<float>(img_height) / layer_height;
        step_w = static_cast<float>(img_width) / layer_width;
    }
    else {
        step_h = step_h_;
        step_w = step_w_;
    }
    int dim = layer_height * layer_width * num_priors_ * 4;
    Y->Resize(1, 2, dim);

    T* top_data = Y->template mutable_data<T>();

    int idx = 0;
    float center_x, center_y;
    float box_width, box_height;
    int min_size_, max_size_;
    float ar;
    for (int h = 0; h < layer_height; ++h) {
        for (int w = 0; w < layer_width; ++w) {
            center_x = (w + offset_) * step_w;
            center_y = (h + offset_) * step_h;
            for (int s = 0; s < min_sizes_.size(); ++s) {
                min_size_ = min_sizes_[s];
                box_width = box_height = min_size_ / 2.;
                top_data[idx++] = (center_x - box_width) / img_width;
                top_data[idx++] = (center_y - box_height) / img_height;
                top_data[idx++] = (center_x + box_width) / img_width;
                top_data[idx++] = (center_y + box_height) / img_height;

                if (max_sizes_.size() > 0) {
                    //CAFFE_ENFORCE_EQ(min_sizes_.size(), max_sizes_.size());
                    max_size_ = max_sizes_[s];
                    box_width = box_height = sqrt(min_size_ * max_size_) / 2.;
                    top_data[idx++] = (center_x - box_width) / img_width;
                    top_data[idx++] = (center_y - box_height) / img_height;
                    top_data[idx++] = (center_x + box_width) / img_width;
                    top_data[idx++] = (center_y + box_height) / img_height;
                }

                for (int r = 1; r < aspts_.size(); ++r) {
                    ar = aspts_[r];
                    box_width = min_size_ * sqrt(ar) / 2.;
                    box_height = min_size_ / sqrt(ar) / 2.;
                    top_data[idx++] = (center_x - box_width) / img_width;
                    top_data[idx++] = (center_y - box_height) / img_height;
                    top_data[idx++] = (center_x + box_width) / img_width;
                    top_data[idx++] = (center_y + box_height) / img_height;
                }
            }
        }
    }

    if (clip_) {
        for (int d = 0; d < dim; ++d) {
            top_data[d] = std::min<T>(std::max<T>(top_data[d], 0.), 1.);
        }
    }

    // second part
    top_data += dim;
    if (variance_.size() == 1) {
        math::Set<T, CPUContext>(dim, float(variance_[0]), top_data, &context_);
    }
    else {
        int count = 0;
        for (int i = 0; i < layer_height*layer_width*num_priors_; ++i) {
            for (int j = 0; j < 4; ++j) {
                top_data[count] = static_cast<T>(variance_[j]);
                ++count;
            }
        }
    }

    return true;
}

    REGISTER_CPU_OPERATOR(PriorBox, PriorBoxOp<float, CPUContext>);
    NO_GRADIENT(PriorBox);

    OPERATOR_SCHEMA(PriorBox)
        .NumInputs(2)
        .NumOutputs(1)
        .TensorInferenceFunction([](const OperatorDef& def, const vector<TensorShape>& in) {
        ArgumentHelper helper(def);
        const StorageOrder order = StringToStorageOrder(
            helper.GetSingleArgument<string>("order", "NCHW"));
        int layer_height = in[0].dims(2); int layer_width = in[0].dims(3);

        vector<float> aspects = helper.GetRepeatedArgument<float>("aspect_ratios");
        vector<float> min_sizes = helper.GetRepeatedArgument<float>("min_sizes");
        vector<float> max_sizes = helper.GetRepeatedArgument<float>("max_sizes");
        bool flip = helper.GetSingleArgument<bool>("flip", true);
        vector<float> aspts;

        aspts.push_back(1);
        for (int i = 0; i < aspects.size(); i++) {
            float ar = aspects[i];
            bool already_exist = false;
            for (int j = 0; j < aspts.size(); j++) {
                if (fabs(ar - aspts[j]) < 1e-6) {
                    already_exist = true;
                    break;
                }
            }
            if (!already_exist) {
                aspts.push_back(ar);
                if (flip) {
                    aspts.push_back(1. / ar);
                }
            }
        }

        int num_priors_ = aspts.size() * min_sizes.size() + max_sizes.size();

        return vector<TensorShape>{
            CreateTensorShape(vector<int>{1, 2, layer_width * layer_height * num_priors_ * 4},
                in[0].data_type()
            )};
    })
        .SetDoc(R"DOC(PriorBoxLayer in SSD)DOC")
        .Arg("min_sizes", "repeated min_sizes")
        .Arg("max_sizes", "optional max_sizes")
        .Arg("aspect_ratios", "repeated aspect ratios")
        .Arg("flip", "1 / ar")
        .Arg("clip", "clip box")
        .Arg("variance", "prior variance")
        .Arg("img_size", "image size")
        .Arg("img_w", "image width")
        .Arg("img_h", "image height")
        .Arg("step", "feature stride")
        .Arg("step_h", "step height")
        .Arg("step_w", "step width")
        .Arg("offset", "offset")
        .Arg("order", "NCHW")
        .Input(0, "X", "NCHW tensor")
        .Input(1, "data", "NCHW input tensor")
        .Output(0, "Y", "prior boxes");

} // namespace caffe2 `

RailWar commented 6 years ago

prior_box_op.cu

` /**

include "caffe2/core/context_gpu.h"

include "caffe2/operators/prior_box_op.h"

include <thrust/device_vector.h>

namespace caffe2 {

namespace {

    // Template structure to pass to kernel
    template <typename T>
    struct KernelArray
    {
        T*  _array;
        int _size;
    };

    // Function to convert device_vector to structure
    template <typename T>
    KernelArray<T> convertToKernel(thrust::device_vector<T>& dVec)
    {
        KernelArray<T> kArray;
        kArray._array = thrust::raw_pointer_cast(&dVec[0]);
        kArray._size = (int)dVec.size();

        return kArray;
    }

    template <typename T>
    __global__ void fill1Kernel(
        const float offset_,
        const float step_w,
        const float step_h,
        const int layer_height,
        const int layer_width,
        const int img_height,
        const int img_width,
        KernelArray<float> min_sizes,
        KernelArray<float> max_sizes,
        KernelArray<float> aspts,
        KernelArray<float> variance,
        bool clip,
        T* top_data) {
        // blockIdx.x - part
        // blockIdx.y - layer_height
        // blockIdx.z - layer_width
        // threadIdx.x - min_sizes._size
        // threadIdx.y - (1 + (max_sizes._size > 0) ? 1 : 0 + (aspts._size-1))
        // threadIdx.z - 4
        int idx = threadIdx.z + blockDim.z * 
                 (threadIdx.y + blockDim.y * 
                 (threadIdx.x + blockDim.x * 
                 ( blockIdx.z +  gridDim.z *
                 ( blockIdx.y +  gridDim.y * blockIdx.x))));

        if (blockIdx.x == 0) {
            float center_x, center_y;
            float box_width, box_height;
            T value;

            center_x = (blockIdx.z + offset_) * step_w;
            center_y = (blockIdx.y + offset_) * step_h;
            if (threadIdx.y == 0) {  // min_size
                box_width = box_height = min_sizes._array[threadIdx.x] / 2;
            } 
            else if ((threadIdx.y == 1) && (max_sizes._size > 0)) { // max_size
                box_width = box_height = sqrt(min_sizes._array[threadIdx.x] * max_sizes._array[threadIdx.x]) / 2;
            } 
            else { // aspts[1..aspts._size-1]
                float ar = aspts._array[threadIdx.y - (max_sizes._size > 0 ? 1 : 0)];
                box_width = min_sizes._array[threadIdx.x] * sqrt(ar) / 2;
                box_height = min_sizes._array[threadIdx.x] / sqrt(ar) / 2;
            }
            switch (threadIdx.z) {
            case 0: value = (center_x - box_width) / img_width;   break;
            case 1: value = (center_y - box_height) / img_height; break;
            case 2: value = (center_x + box_width) / img_width;   break;
            case 3: value = (center_y + box_height) / img_height; break;
            }
            top_data[idx] = clip ? (value < 0 ? 0 : (value > 1 ? 1 : value)) : value;
        }
        else {
            if (variance._size == 1)
                top_data[idx] = variance._array[0];
            else
                top_data[idx] = variance._array[threadIdx.z];
        }
    }

}// namespace

template <typename T, class CUDAContext>
bool PriorBoxOp<T, CUDAContext>::RunOnDevice() {
    const auto& X = Input(0);
    const auto& data = Input(1);
    auto* Y = Output(0);
    //auto* Y = OperatorBase::Output<TensorCPU>(0);

    const int layer_height = X.dim(2);
    const int layer_width = X.dim(3);
    int img_height, img_width;
    if (img_h_ == 0 || img_w_ == 0) {
        img_height = data.dim(2);
        img_width = data.dim(3);
    }
    else {
        img_height = img_h_;
        img_width = img_w_;
    }
    float step_w, step_h;
    if (step_w_ == 0 || step_h_ == 0) {
        step_h = static_cast<float>(img_height) / layer_height;
        step_w = static_cast<float>(img_width) / layer_width;
    }
    else {
        step_h = step_h_;
        step_w = step_w_;
    }
    int dim = layer_height * layer_width * num_priors_ * 4;
    Y->Resize(1, 2, dim);

    // auto* top_data = Y->template mutable_data<T>();
    T* top_data = Y->template mutable_data<T>();

    thrust::device_vector<float> d_min_sizes(min_sizes_);
    thrust::device_vector<float> d_max_sizes(max_sizes_);
    thrust::device_vector<float> d_aspts(aspts_);
    thrust::device_vector<float> d_variance(variance_);
    dim3 blockPerGrid(2, layer_height, layer_width);
    dim3 threadsPerBlock(min_sizes_.size(), num_priors_/ min_sizes_.size(), 4);
    fill1Kernel<T><<<
        blockPerGrid,
        threadsPerBlock,
        0,
        context_.cuda_stream() >>> (
            offset_,
            step_w,
            step_h,
            layer_height,
            layer_width,
            img_height,
            img_width,
            convertToKernel(d_min_sizes),
            convertToKernel(d_max_sizes),
            convertToKernel(d_aspts),
            convertToKernel(d_variance),
            clip_,
            top_data
            );

    return true;
}

REGISTER_CUDA_OPERATOR(PriorBox, PriorBoxOp<float, CUDAContext>);

}

`

RailWar commented 6 years ago

detection_output_op.cu

`#include "caffe2/core/context_gpu.h"

include "caffe2/operators/detection_output_op.h"

namespace caffe2 {

template <typename Dtype>
__global__ void DecodeBBoxesKernel(const int nthreads,
    const Dtype* loc_data, const Dtype* prior_data,
    const CodeType code_type, const bool variance_encoded_in_target,
    const int num_priors, const bool share_location,
    const int num_loc_classes, const int background_label_id,
    const bool clip_bbox, Dtype* bbox_data) {
    CUDA_1D_KERNEL_LOOP(index, nthreads) {
        const int i = index % 4;
        const int c = (index / 4) % num_loc_classes;
        const int d = (index / 4 / num_loc_classes) % num_priors;
        if (!share_location && c == background_label_id) {
            // Ignore background class if not share_location.
            return;
        }
        const int pi = d * 4;
        const int vi = pi + num_priors * 4;
        if (code_type == ::caffe::PriorBoxParameter_CodeType_CORNER) {
            if (variance_encoded_in_target) {
                // variance is encoded in target, we simply need to add the offset
                // predictions.
                bbox_data[index] = prior_data[pi + i] + loc_data[index];
            }
            else {
                // variance is encoded in bbox, we need to scale the offset accordingly.
                bbox_data[index] =
                    prior_data[pi + i] + loc_data[index] * prior_data[vi + i];
            }
        }
        else if (code_type == ::caffe::PriorBoxParameter_CodeType_CENTER_SIZE) {
            const Dtype p_xmin = prior_data[pi];
            const Dtype p_ymin = prior_data[pi + 1];
            const Dtype p_xmax = prior_data[pi + 2];
            const Dtype p_ymax = prior_data[pi + 3];
            const Dtype prior_width = p_xmax - p_xmin;
            const Dtype prior_height = p_ymax - p_ymin;
            const Dtype prior_center_x = (p_xmin + p_xmax) / 2.;
            const Dtype prior_center_y = (p_ymin + p_ymax) / 2.;

            const Dtype xmin = loc_data[index - i];
            const Dtype ymin = loc_data[index - i + 1];
            const Dtype xmax = loc_data[index - i + 2];
            const Dtype ymax = loc_data[index - i + 3];

            Dtype decode_bbox_center_x, decode_bbox_center_y;
            Dtype decode_bbox_width, decode_bbox_height;
            if (variance_encoded_in_target) {
                // variance is encoded in target, we simply need to retore the offset
                // predictions.
                decode_bbox_center_x = xmin * prior_width + prior_center_x;
                decode_bbox_center_y = ymin * prior_height + prior_center_y;
                decode_bbox_width = exp(xmax) * prior_width;
                decode_bbox_height = exp(ymax) * prior_height;
            }
            else {
                // variance is encoded in bbox, we need to scale the offset accordingly.
                decode_bbox_center_x =
                    prior_data[vi] * xmin * prior_width + prior_center_x;
                decode_bbox_center_y =
                    prior_data[vi + 1] * ymin * prior_height + prior_center_y;
                decode_bbox_width =
                    exp(prior_data[vi + 2] * xmax) * prior_width;
                decode_bbox_height =
                    exp(prior_data[vi + 3] * ymax) * prior_height;
            }

            switch (i) {
            case 0:
                bbox_data[index] = decode_bbox_center_x - decode_bbox_width / 2.;
                break;
            case 1:
                bbox_data[index] = decode_bbox_center_y - decode_bbox_height / 2.;
                break;
            case 2:
                bbox_data[index] = decode_bbox_center_x + decode_bbox_width / 2.;
                break;
            case 3:
                bbox_data[index] = decode_bbox_center_y + decode_bbox_height / 2.;
                break;
            }
        }
        else if (code_type == ::caffe::PriorBoxParameter_CodeType_CORNER_SIZE) {
            const Dtype p_xmin = prior_data[pi];
            const Dtype p_ymin = prior_data[pi + 1];
            const Dtype p_xmax = prior_data[pi + 2];
            const Dtype p_ymax = prior_data[pi + 3];
            const Dtype prior_width = p_xmax - p_xmin;
            const Dtype prior_height = p_ymax - p_ymin;
            Dtype p_size;
            if (i == 0 || i == 2) {
                p_size = prior_width;
            }
            else {
                p_size = prior_height;
            }
            if (variance_encoded_in_target) {
                // variance is encoded in target, we simply need to add the offset
                // predictions.
                bbox_data[index] = prior_data[pi + i] + loc_data[index] * p_size;
            }
            else {
                // variance is encoded in bbox, we need to scale the offset accordingly.
                bbox_data[index] =
                    prior_data[pi + i] + loc_data[index] * prior_data[vi + i] * p_size;
            }
        }
        else {
            // Unknown code type.
        }
        if (clip_bbox) {
            bbox_data[index] = max(min(bbox_data[index], Dtype(1.)), Dtype(0.));
        }
    }
}

template <typename Dtype>
__global__ void PermuteDataKernel(const int nthreads,
    const Dtype* data, const int num_classes, const int num_data,
    const int num_dim, Dtype* new_data) {
    CUDA_1D_KERNEL_LOOP(index, nthreads) {
        const int i = index % num_dim;
        const int c = (index / num_dim) % num_classes;
        const int d = (index / num_dim / num_classes) % num_data;
        const int n = index / num_dim / num_classes / num_data;
        const int new_index = ((n * num_classes + c) * num_data + d) * num_dim + i;
        new_data[new_index] = data[index];
    }
}

template<>
bool DetectionOutputOp<float, CUDAContext>::RunOnDevice() {
    auto& loc = Input(0);
    auto& conf = Input(1);
    auto& prior = Input(2);
    auto* Y = OperatorBase::Output<TensorCPU>(0);

    const float* loc_data = loc.data<float>();
    const float* conf_data = conf.data<float>();
    const float* prior_data = prior.data<float>();
    const int num = loc.dim(0);

    bbox_preds_.ResizeLike(loc);
    if (!share_location_) {
        bbox_permute_.ResizeLike(loc);
    }
    conf_permute_.ResizeLike(conf);

    float* bbox_data = bbox_preds_.mutable_data<float>();
    const int loc_count = bbox_preds_.size();
    const bool clip_bbox = false;
    const int num_priors_ = prior.dim(2) / 4;
    CAFFE_ENFORCE_EQ(num_priors_ * num_loc_classes_ * 4, loc.dim(1));
    CAFFE_ENFORCE_EQ(num_priors_ * num_classes_, conf.dim(1));

    DecodeBBoxesKernel<float> <<<
        CAFFE_GET_BLOCKS(loc_count),
        CAFFE_CUDA_NUM_THREADS,
        0,
        context_.cuda_stream() >>>
        (loc_count, loc_data, prior_data, code_type_, variance_encoded_in_target_,
            num_priors_, share_location_, num_loc_classes_, background_label_id_,
            clip_bbox, bbox_data);

    const float* bbox_cpu_data = (float*)malloc(bbox_permute_.size() * sizeof(float));
    Tensor<CPUContext> bbox_device_data;
    if (!share_location_) {
        float* bbox_permute_data = bbox_permute_.mutable_data<float>();
        PermuteDataKernel<float> <<<
            CAFFE_GET_BLOCKS(loc_count),
            CAFFE_CUDA_NUM_THREADS, 0,
            context_.cuda_stream() >>>
            (loc_count, bbox_data, num_loc_classes_, num_priors_, 4, bbox_permute_data);
        bbox_device_data.CopyFrom(bbox_permute_, &context_);
        bbox_cpu_data = bbox_device_data.data<float>();
    }
    else {
        bbox_device_data.CopyFrom(bbox_preds_, &context_);
        bbox_cpu_data = bbox_device_data.data<float>();
    }

    // Retrieve all confidences.
    float* conf_permute_data = conf_permute_.mutable_data<float>();
    const int conf_count = conf.size();
    PermuteDataKernel<float> <<<
        CAFFE_GET_BLOCKS(conf_count),
        CAFFE_CUDA_NUM_THREADS, 0,
        context_.cuda_stream() >>>
        (conf_count, conf_data, num_classes_, num_priors_, 1, conf_permute_data);
    Tensor<CPUContext> conf_device_data;
    conf_device_data.CopyFrom(conf_permute_, &context_);
    const float* conf_cpu_data = conf_device_data.data<float>();

    // I found CopyFrom use cudaMemcpyAsync, 
    // here in my opinion should wait and check kernal function finished.
    context_.FinishDeviceComputation();

    int num_kept = 0;
    vector<map<int, vector<int> > > all_indices;
    for (int i = 0; i < num; ++i) {
        map<int, vector<int> > indices;
        int num_det = 0;
        const int conf_idx = i * num_classes_ * num_priors_;
        int bbox_idx;
        if (share_location_) {
            bbox_idx = i * num_priors_ * 4;
        }
        else {
            bbox_idx = conf_idx * 4;
        }
        for (int c = 0; c < num_classes_; ++c) {
            if (c == background_label_id_) {
                // Ignore background class.
                continue;
            }
            const float* cur_conf_data = conf_cpu_data + conf_idx + c * num_priors_;
            const float* cur_bbox_data = bbox_cpu_data + bbox_idx;
            if (!share_location_) {
                cur_bbox_data += c * num_priors_ * 4;
            }
            ApplyNMSFast(cur_bbox_data, cur_conf_data, num_priors_,
                confidence_threshold_, nms_threshold_, eta_, top_k_, &(indices[c])); //==
            num_det += indices[c].size();
        }
        if (keep_top_k_ > -1 && num_det > keep_top_k_) {
            vector<pair<float, pair<int, int> > > score_index_pairs;
            for (map<int, vector<int> >::iterator it = indices.begin();
                it != indices.end(); ++it) {
                int label = it->first;
                const vector<int>& label_indices = it->second;
                for (int j = 0; j < label_indices.size(); ++j) {
                    int idx = label_indices[j];
                    float score = conf_cpu_data[conf_idx + label * num_priors_ + idx];
                    score_index_pairs.push_back(std::make_pair(
                        score, std::make_pair(label, idx)));
                }
            }
            // Keep top k results per image.
            std::sort(score_index_pairs.begin(), score_index_pairs.end(),
                SortScorePairDescend<pair<int, int> >); //==
            score_index_pairs.resize(keep_top_k_);
            // Store the new indices.
            map<int, vector<int> > new_indices;
            for (int j = 0; j < score_index_pairs.size(); ++j) {
                int label = score_index_pairs[j].second.first;
                int idx = score_index_pairs[j].second.second;
                new_indices[label].push_back(idx);
            }
            all_indices.push_back(new_indices);
            num_kept += keep_top_k_;
        }
        else {
            all_indices.push_back(indices);
            num_kept += num_det;
        }
    }
    float* top_data;
    if (num_kept == 0) {
        LOG(INFO) << "Couldn't find any detections";
        Y->Resize(1, 1, num, 7);
        top_data = Y->mutable_data<float>();
        // here context_ is CUDAContext, but we need CPUContext.
        // So i set 0
        math::Set<float, CPUContext>(num * 7, float(-1), top_data, 0);
        for (int i = 0; i < num; ++i) {
            top_data[0] = i;
            top_data += 7;
        }
    }
    else {
        Y->Resize(1, 1, num_kept, 7);
        top_data = Y->mutable_data<float>();
    }

    int count = 0;
    for (int i = 0; i < num; ++i) {
        const int conf_idx = i * num_classes_ * num_priors_;
        int bbox_idx;
        if (share_location_) {
            bbox_idx = i * num_priors_ * 4;
        }
        else {
            bbox_idx = conf_idx * 4;
        }
        for (map<int, vector<int> >::iterator it = all_indices[i].begin();
            it != all_indices[i].end(); ++it) {
            int label = it->first;
            vector<int>& indices = it->second;
            const float* cur_conf_data = conf_cpu_data + conf_idx + label * num_priors_;
            const float* cur_bbox_data = bbox_cpu_data + bbox_idx;
            if (!share_location_) {
                cur_bbox_data += label * num_priors_ * 4;
            }
            for (int j = 0; j < indices.size(); ++j) {
                int idx = indices[j];
                top_data[count * 7] = i;
                top_data[count * 7 + 1] = label;
                top_data[count * 7 + 2] = cur_conf_data[idx];
                for (int k = 0; k < 4; ++k) {
                    top_data[count * 7 + 3 + k] = cur_bbox_data[idx * 4 + k];
                }
                count++;
            }
        }
    }

    return true;
}

REGISTER_CUDA_OPERATOR(DetectionOutput, DetectionOutputOp<float, CUDAContext>);

} // namespace caffe2 `

z0yuu commented 6 years ago

@KeyKy hello! I have git clone your code and tried to make install, but the error occurs
[ 80%] Linking CXX executable ../bin/operator_test ../lib/libcaffe2_gpu.so:undefined refrence to symbol caffe2::CAFFE2_PLEASE_ADD_OPERATOR_SCHEMA_FOR_DetectionOutput()’ collect2: error: ld returned 1 exit status caffe2/CMakeFiles/fixed_divisor_test.dir/b

It has troubled me for a long time,Could you tell me how to deal with that? Thank You 。

zhangw864680355 commented 6 years ago

@KeyKy Hi, I have try the code you write in codecaffe2_SSD/caffe2/python/examples/ssd/visualize_caffe2_implementation_det.ipynb (https://github.com/KeyKy/caffe2_SSD/blob/master/caffe2/python/examples/ssd/visualize_caffe2_implementation_det.ipynb), but I do it in yolo model in caffe2.Regretfully,I do not see the result and box in picture ,I try to add plt.imshow(image) plt.savefig("test.jpg") plt.show() at the end and it dose nothing! Why can you show the the result and box in picture ? Maybe, there is something wrong,can you help me ? Thanks!!!