workspace.Predictor() getting only features

ilkarman commented 6 years ago

I'm trying to generate a 2048 feature-vector from the penultimate layer of a trained ResNet50 model; like this Keras example:

model = ResNet50(include_top=False) However, I can't see a way to do this in Caffe2 without running all the forward passes (including final softmax) and then extracting the relevant blob.

#%%bash
#wget https://github.com/leonardvandriel/caffe2_models/raw/master/model/resnet50_init_net.pb
#wget https://github.com/leonardvandriel/caffe2_models/raw/master/model/resnet50_predict_net.pb
# Load model
with open("resnet50_init_net.pb", "rb") as f:
     init_net = f.read()
with open("resnet50_predict_net.pb", "rb") as f:
    predict_net = f.read()   
workspace.RunNetOnce(init_net)
workspace.CreateNet(predict_net)
p = workspace.Predictor(init_net, predict_net)
def predict_fn(classifier, data, batchsize):
    """ Return features from classifier """
    out = np.zeros((len(data), RESNET_FEATURES), np.float32)
    for idx, dta in yield_mb(data, batchsize):
        results = classifier.run([dta])
        # Last feature layer should be this average-pooling
        # I think ... "pool5" from http://ethereon.github.io/netscope/#/gist/db945b393d40bfa26006
        # Still running softmax so seems wasteful
        out[idx*batchsize:(idx+1)*batchsize] = workspace.FetchBlob('pool5').squeeze()
    return out
%%time
# 48.5s ???? Something must be wrong
# Forward-pass batches
features = predict_fn(p, fake_input_data_cf, BATCH_SIZE)

My issue is that: running the full network through seems to hit inference speed bad and my test example (on CPU) takes the same time as the Keras model I tried (and slower than CNTK or TensorFlow). Is there a way to kill the last few layers?

I'm basically trying to get the fastest inference time possible (up to the feature vector) on CPU.

Edit:

I've tried another method of loading and running inference but it is still not any faster:

def load_net(INIT_NET, PREDICT_NET, device_opts):
    init_def = caffe2_pb2.NetDef()
    with open(INIT_NET, 'rb') as f:
        init_def.ParseFromString(f.read())
        init_def.device_option.CopyFrom(device_opts)
        workspace.RunNetOnce(init_def.SerializeToString())
    net_def = caffe2_pb2.NetDef()
    with open(PREDICT_NET, 'rb') as f:
        net_def.ParseFromString(f.read())
        net_def.device_option.CopyFrom(device_opts)
        workspace.CreateNet(net_def.SerializeToString(), overwrite=True)
    return net_def.name
device_opts = core.DeviceOption(caffe2_pb2.CPU, 0) 
test_net = load_net('resnet50_init_net.pb', 'resnet50_predict_net.pb',
                    device_opts=device_opts)
def predict_fn(classifier, data, batchsize):
    """ Return features from classifier """
    out = np.zeros((len(data), RESNET_FEATURES), np.float32)
    for idx, dta in yield_mb(data, batchsize):
        workspace.FeedBlob("data", dta, device_option=device_opts)
        workspace.RunNet(classifier, 1)
        out[idx*batchsize:(idx+1)*batchsize] = workspace.FetchBlob('pool5').squeeze()
    return out
features = predict_fn(test_net, fake_input_data_cf, BATCH_SIZE)

Yangqing commented 6 years ago

Yeah, so Caffe2 right now computes all the operators as instructed and one would need to manually remove the unneeded operators manually. The are a few ad hoc scripts we use to do so for common product use cases.

For proper tooling - we are prioritizing this in the upcoming months.

This could also be a possible good path for predictor auto-optimization - cc @salexspb .

Thanks for the detailed note!

salexspb commented 6 years ago

Yeah, we probably can just omit calculation of things that are not needed for an output calculation.

akyrola commented 6 years ago

model_helper.ExtractPredictorNet() would allow you to extract only the relevant portion of the net. It allows you to define inputs and outputs, so in this case the outputs would be 'pool5'.

ilkarman commented 6 years ago

Thanks for the comments, I will try model_helper (and also to add an arg_scope to the model). @Yangqing I'm not sure if this a different topic but I haven't been able to get good inference speeds using caffe2 on CPU or GPU. For example, with MXNET I get 150 images/s on GPU and 12/s on CPU (forward-passes to avg_pooling layer in ResNet50. However, with Caffe2 I get 69/s and 5/s respectively.

I don't think this is because of computing an additional softmax layer. My ideas were:

Try a different model-file (perhaps converting the PyTorch one to ONNX and importing here)
Defining a clear testing arg_scope using model_helper

The training time of resnet50 on Caffe2 for me is the fastest amongst all other frameworks, so I'm not sure what I'm doing wrong with inference.

dlwtojd26 commented 6 years ago

@ilkarman hi, i have the same problem. in my case open mp do not work properly. did you check openmp working? I think openmp does not working

ilkarman commented 6 years ago

What's interesting for me is when I run inference on Caffe2 using onnx_caffe2.backend, then Caffe2 is quite a bit faster (but still slower than MXNet)

DL Library	Images/s GPU	Images/s CPU
ONNX_Caffe2	74.9	6.4
Caffe2	68.1	5.8
MXNet	139.1	35.0

pietern commented 6 years ago

@ilkarman Do you have the notebooks for those tests somewhere?

facebookarchive / caffe2

workspace.Predictor() getting only features #1476