GoogleCloudPlatform / python-docs-samples

Code samples used on cloud.google.com
Apache License 2.0
7.35k stars 6.4k forks source link

Version creation failed: user-provided package test_code-0.1.tar.gz failed to install #2186

Closed leoninekev closed 4 years ago

leoninekev commented 5 years ago

I'm implementing keras model for object detection. Training of which in ml-engine has successfully resulted in a model_weights.hdf5 file. In order to get online prediction for test images, I'm following custom prediction routine suggested in GC ai-platform documentation here https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/ml_engine/custom-prediction-routines/tensorflow-predictor.py to serve model & its artifact code in cloud for prediction.

for which i modified MyPredictor class in predictor.py module as follows:

class MyPredictor(object):
  def __init__(self, config, model_rpn, model_classifier):
    self._config =config
    self._model_rpn= model_rpn
    self._model_classifier= model_classifier

  def format_img_size(self, img):
    img_min_side = float(self._config['im_size'])
    (height,width,_) = img.shape
    if width <= height:
        ratio = img_min_side/width
        new_height = int(ratio * height)
        new_width = int(img_min_side)
    else:            
        ratio = img_min_side/height
        new_width = int(ratio * width)
        new_height = int(img_min_side)
    img = cv2.resize(img, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
    return img, ratio

  def format_img_channels(self, img):
    img = img[:, :, (2, 1, 0)]
    img = img.astype(np.float32)
    img[:, :, 0] -= self._config['img_channel_mean'][0]
    img[:, :, 1] -= self._config['img_channel_mean'][1]
    img[:, :, 2] -= self._config['img_channel_mean'][2]
    img /= self._config['img_scaling_factor']
    img = np.transpose(img, (2, 0, 1))
    img = np.expand_dims(img, axis=0)
    return img

  def format_img(self, img):        
    img, ratio = self.format_img_size(img)
    img = self.format_img_channels(img)
    return img, ratio

  def get_real_coordinates(self, ratio, x1, y1, x2, y2):
    real_x1 = int(round(x1 // ratio))
    real_y1 = int(round(y1 // ratio))
    real_x2 = int(round(x2 // ratio))
    real_y2 = int(round(y2 // ratio))
    return (real_x1, real_y1, real_x2 ,real_y2)

  def preprocess(self, inputs):
    X, ratio = self.format_img(inputs)        
    if K.image_dim_ordering() == 'tf':
        X = np.transpose(X, (0, 2, 3, 1))            
    [Y1, Y2, F] = self._model_rpn.predict(X)
    R = roi_helpers.rpn_to_roi(Y1, Y2, self._config, K.image_dim_ordering(), overlap_thresh=0.7)

    R[:, 2] -= R[:, 0]
    R[:, 3] -= R[:, 1]

    bboxes = {}
    probs = {}
    bbox_threshold = 0.8

    class_mapping= self._config['class_mapping']

    for jk in range(R.shape[0]//self._config['num_rois'] + 1):
        ROIs = np.expand_dims(R[self._config['num_rois']*jk:self._config['num_rois']*(jk+1), :], axis=0)
        if ROIs.shape[1] == 0:
            break
        if jk == R.shape[0]//self._config['num_rois']:                
            curr_shape = ROIs.shape
            target_shape = (curr_shape[0],self._config['num_rois'],curr_shape[2])
            ROIs_padded = np.zeros(target_shape).astype(ROIs.dtype)
            ROIs_padded[:, :curr_shape[1], :] = ROIs
            ROIs_padded[0, curr_shape[1]:, :] = ROIs[0, 0, :]
            ROIs = ROIs_padded
        [P_cls, P_regr] = self._model_classifier.predict([F, ROIs])            
        for ii in range(P_cls.shape[1]):
            if np.max(P_cls[0, ii, :]) < bbox_threshold or np.argmax(P_cls[0, ii, :]) == (P_cls.shape[2] - 1):
                continue                
            cls_name = class_mapping[np.argmax(P_cls[0, ii, :])]                
            if cls_name not in bboxes:
                bboxes[cls_name] = []
                probs[cls_name] = []                    
            (x, y, w, h) = ROIs[0, ii, :]                
            cls_num = np.argmax(P_cls[0, ii, :])                
            try:
                (tx, ty, tw, th) = P_regr[0, ii, 4*cls_num:4*(cls_num+1)]
                tx /= self._config['classifier_regr_std'][0]
                ty /= self._config['classifier_regr_std'][1]
                tw /= self._config['classifier_regr_std'][2]
                th /= self._config['classifier_regr_std'][3]
                x, y, w, h = roi_helpers.apply_regr(x, y, w, h, tx, ty, tw, th)
            except:
                pass
            bboxes[cls_name].append([self._config['rpn_stride']*x, self._config['rpn_stride']*y, self._config['rpn_stride']*(x+w), self._config['rpn_stride']*(y+h)])
            probs[cls_name].append(np.max(P_cls[0, ii, :]))
    return [bboxes, probs]

  def postprocess(self, bounding_boxes, probabilities):
    all_dets=[]
    bboxes=bounding_boxes
    probs=probabilities        
    for key in bboxes:
        bbox = np.array(bboxes[key])
        new_boxes, new_probs = roi_helpers.non_max_suppression_fast(bbox, np.array(probs[key]), overlap_thresh=0.5)
        for jk in range(new_boxes.shape[0]):
            (x1, y1, x2, y2) = new_boxes[jk,:]
            coord_list= list(self.get_real_coordinates(ratio, x1, y1, x2, y2))
            all_dets.append((key,100*new_probs[jk],coord_list))
    return all_dets

  def predict(self, instances):
    inputs = np.asarray(instances)
    [bboxes, probs]= self.preprocess(inputs)        
    results = self.postprocess(bboxes, probs)
    return results.tolist()

  @classmethod
  def from_path(cls, model_dir):        
    model_path= os.path.join(model_dir,'model_frcnn.hdf5')
    num_features = 1024       
    config ={'verbose': True, 'network': 'resnet50', 'use_horizontal_flips': False, 'use_vertical_flips': False, 'rot_90': False, 'anchor_box_scales': [128, 256, 512], 'anchor_box_ratios': [[1, 1], [0.7071067811865475, 1.414213562373095], [1.414213562373095, 0.7071067811865475]], 'im_size': 600, 'img_channel_mean': [103.939, 116.779, 123.68], 'img_scaling_factor': 1.0, 'num_rois': 32, 'rpn_stride': 16, 'balanced_classes': False, 'std_scaling': 4.0, 'classifier_regr_std': [8.0, 8.0, 4.0, 4.0], 'rpn_min_overlap': 0.3, 'rpn_max_overlap': 0.7, 'classifier_min_overlap': 0.1, 'classifier_max_overlap': 0.5, 'class_mapping': {'cake': 0, 'donuts': 1, 'dosa': 2, 'bg': 3}, 'model_path': './model_frcnn.hdf5', 'base_net_weights': 'resnet50_weights_tf_dim_ordering_tf_kernels.h5'}        
    class_mapping = config['class_mapping']
    if 'bg' not in class_mapping:
        class_mapping['bg'] = len(class_mapping)        
    class_mapping = {v: k for k, v in class_mapping.items()}
    config['class_mapping']= class_mapping        
    input_shape_img = (None, None, 3)
    input_shape_features = (None, None, num_features)        
    img_input = Input(shape=input_shape_img)
    roi_input = Input(shape=(config['num_rois'], 4))
    feature_map_input = Input(shape=input_shape_features)       
    shared_layers = nn.nn_base(img_input, trainable=True)        
    num_anchors = len(config['anchor_box_scales']) * len(config['anchor_box_ratios'])
    rpn_layers = nn.rpn(shared_layers, num_anchors)
    classifier = nn.classifier(feature_map_input, roi_input, config['num_rois'], nb_classes=len(class_mapping), trainable=True)
    model_rpn = Model(img_input, rpn_layers)        
    model_classifier = Model([feature_map_input, roi_input], classifier)        
    model_rpn.load_weights(model_path, by_name=True)
    model_classifier.load_weights(model_path, by_name=True)
    model_rpn.compile(optimizer='sgd', loss='mse')
    model_classifier.compile(optimizer='sgd', loss='mse')

    return cls(config, model_rpn, model_classifier)

During versioning of which if i'm using default python 2.7; Although a version is successfully created but when tested by providing JSON format numpy array converted to list, it throws.

{
  "error": "Prediction failed: unknown error."
}

It outputs a list of bboxes & labels when ran locally.

Also on attempting to create another version with --python-version flag 3.5, It now even fails to create a Version, with an error:

Create Version failed. Bad model detected with error: "Failed to load model: User-provided package test_code-0.1.tar.gz failed to install: Command '['python-default', '-m', 'pip', 'install', '--target=/tmp/custom_lib', '--no-cache-dir', '-b', '/tmp/pip_builds', '/tmp/custom_code/test_code-0.1.tar.gz']' returned non-zero exit status 1 (Error code: 0)"

I'd really appreicate, any help/corrections to workaround & serve online predictions using this keras weights file?

htappen commented 5 years ago

Can you please share the setup.py file you used to creat the package?

leoninekev commented 5 years ago

Yes, here's my setup.py code:

from setuptools import setup, find_packages

NAME = 'test_code'
VERSION = '0.1'
REQUIRED_PACKAGES = ['keras','h5py']

setup(
    name=NAME,
    version=VERSION,
    packages=find_packages(),
    install_requires=REQUIRED_PACKAGES,    
    scripts=['predictor.py','roi_helpers.py','resnet.py','FixedBatchNormalization.py','RoiPoolingConv.py'])
htappen commented 5 years ago

1) You should get stderr logging. Make sure to set onlinePredictionLogging = True and onlinePredictionConsoleLogging = True when you create the model. 2) On the Python setup, my suspicion is that you need to recompile the package from Python 3.5 to get all the right packages for version 3.5

leoninekev commented 5 years ago

I re-created the model resource with --enable-console-logging flag this time, Now i'm getting sterr logs like during training job submission. But again at Version creation, with --python-version 3.5, after successful pkg collection & installation, it is interrupted with log: Failed to load model: Unexpected error when loading the model: Shape must be rank 1 but is rank 0 for 'bn_conv1/Reshape_4' (op: 'Reshape') with input shapes: [1,1,1,64], []. (Error code: 0)

Whereas omitting --python-version 3.5, creates a version like before, but last few logs output: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA" i recall the cloudshell output the same when i ran my package locally. what does that mean?

puneith commented 5 years ago

@andrewferlitsch can you take a look at this Keras issue.

leoninekev commented 5 years ago

The later part pertaining to versioning fail is dealt, with changing TF runtime version from 1.13 to 1.5. But the former prediction fail error persists, despite following the custom prediction routine's documentation. Any help on that? So for a while to dodge this i've tried running my application using flask as alternate, it runs well & now i'm up to hosting it on some production server as flask api using google-app-engine maybe? please suggest would app-engine be recommended over google-ai-platform (one i was using previously) in this case? what are the downsides?

andrewferlitsch commented 5 years ago

The error message (shape must be rank 1 but is rank 0) means that the layer expected a 1D vector, but got a scalar value. I googled the exact error message and found several references in Japanese and one in English. The layer 'bn_conv1/Reshape_4' matches the Keras Faster-RNN model. The one English answer to this problem I could find was dated April 19, 2019:

_I had this same error. I seem to have gotten the program to start learning by editing the keras source. In file keras/backend/tensorflowbackend.py I found 4 reshape functions near each other, one of which was involved in the error. I changed the second argument of each of these from (-1) to [(-1)]. This allowed the program to run. Unfortunately, this is a dangerous change since I don’t actually know everything that will be affected.

This is another answer, translated from Japanese, from a posting dated Dec 10, 2018:

When calling faster RCnn shared_layers = nn.nn_base(img_input, trainable=True) , error:

InvalidArgumentError: Shape must be rank 1 but is rank 0 for 'bn_conv1_1/Reshape_4' (op: 'Reshape') with input shapes: [1,1,1,64], [].

After reviewing, it was found to be a problem with BatchNormalization. The following code will be similarly reported.

From keras.layers import BatchNormalization, Input

x = Input(shape=(1, 2, 2))

BatchNormalization(axis=1)(x)

Error: InvalidArgumentError: Shape must be rank 1 but is rank 0 for 'batch_normalization_1/cond/Reshape_4' (op: 'Reshape') with input shapes: [1,1,1,1], [].

There is no problem on the CPU version of keras 2.2.0, there is a problem with the gup version of keras. So the keras version is reduced to 2.1.6:

Pip3 uninstall keras

Pip3 install keras==2.1.6 -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

Not reporting an error


Author: small white north Source: CSDN Original: https://blog.csdn.net/weixin_40755306/article/details/84944008 Copyright statement: This article is the original article of the blogger, please attach the blog post link!

leoninekev commented 5 years ago

yes, I added exact version numbers for dependencies in my setup.py file, wherein i used keras==2.2.0, That mitigated the versioning error i was getting lately. But going by Stackdriver logs, for that former anomalous error- "error": "Prediction failed: unknown error." I noticed it is preceded by error: Prediction failed: predict() got an unexpected keyword argument 'stats'

I made few modifications to predict method in MyPredictor class above to process b64 encoded image requested through JSON string as:

def predict(self, instances):
        inputs= base64.b64decode(instances['image_bytes']['b64'])
        inputs= scipy.misc.imread(io.BytesIO(inputs))
        inputs= inputs[...,::-1]
        [bboxes, probs, ratio]= self.preprocess(inputs)        
        results = self.postprocess(bboxes, probs, ratio)
        return results

At CloudSDK, i'm requesting prediction to that versioned model as:

with open('3.jpg','rb') as image:
      img= base64.b64encode(image.read())
      instances= {'image_bytes': {'b64': base64.b64encode(img).decode()}}

name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, VERSION_NAME)
response = service.projects().predict(name=name,body={'instances': instances}).execute()

It outputs:

>>>response
{error": "Prediction failed: unknown error."}

But nowhere did i notice or input, keyword argument 'stats'; Neither during prediction request; Nor in MyPredictor class. Is there something i'm skipping here?

Following are the extended logs of the above error :

{
 insertId:  "5d231514000b3916750b34e9"  
 logName:  "projects/project-281612/logs/ml.googleapis.com%2Fprimary.stderr"  
 receiveTimestamp:  "2019-07-08T10:04:04.865116250Z"  
 resource: {
  labels: {
   model_id:  "Mod_050519"    
   project_id:  "project-281612"    
   region:  ""    
   version_id:  "v5_a"    
  }
  type:  "cloudml_model_version"   
 }
 textPayload:  "(07/08/2019 10:04:04 AM Prediction failed: predict() got an unexpected keyword argument 'stats'"  
 timestamp:  "2019-07-08T10:04:04.735510Z"  
}

Please take a look?

Dana-Farber commented 5 years ago

+1 we too are blocked by this bug.

andrewferlitsch commented 5 years ago

@leoninekev @Dana-Farber Did you try this recommendation from a blog poster who had a similar problem:

There is no problem on the CPU version of keras 2.2.0, there is a problem with the GPU version of keras. So the keras version is reduced to 2.1.6:

Pip3 uninstall keras

Pip3 install keras==2.1.6 -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

Dana-Farber commented 5 years ago

There are two bugs -- one dealing with keras and one with gcp version creation. The gcp version creation is our issue since we aren’t using keras.

andrewferlitsch commented 4 years ago

@dizcology reassigning per Dana-Farber comment that this is not Keras but GCP issue.

czahedi commented 4 years ago

Hey Yu-Han can you take a look at this? Thanks!

dizcology commented 4 years ago

@leoninekev Apologies for following this up so late - are you still experiencing the issues as mentioned above?

czahedi commented 4 years ago

Hey @dizcology have you heard from the user?

dizcology commented 4 years ago

No updates. Closing this for now.

@leoninekev @Dana-Farber please reopen this thread if you are still experiencing the issues.

didiallo commented 1 year ago

still have the same issue in 2023. ca someone help please?