databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
1.99k stars 494 forks source link

ValueError: Cannot extract any feature from dataset! #195

Closed innat closed 5 years ago

innat commented 5 years ago

We've got a dataset of some images(png) which loaded into the spark data-frame. It has transformed and has the following columns (image URI, label, one-hot-label)


train.show()

+--------------------+-----+-------------+
|    image      |  label   |one_hot_label|
+--------------------+-----+-------------+
|sample_0_URI...|    9     |    (9,[],[])|
|sample_0_URI...|    9     |    (9,[],[])|
|sample_0_URI...|    6     |(9,[6],[1.0])|
|sample_0_URI...|    6     |(9,[6],[1.0])|
|sample_0_URI...|    6     |(9,[6],[1.0])|
|sample_0_URI...|    6     |(9,[6],[1.0])|
|sample_0_URI...|    6     |(9,[6],[1.0])|
|sample_0_URI...|    6     |(9,[6],[1.0])|

We're trying to implement distributed hyperparameter tuning using a custom Keras model. KerasImageFileEstimator

def load_image_from_uri(local_uri):
  img = ...
  img_arr = np.array(img).astype(np.float32)
  img_tnsr = preprocess_input(img_arr[np.newaxis, :])
  return img_tnsr

estimator = KerasImageFileEstimator( inputCol="image",
                                     outputCol="prediction",
                                     labelCol="one_hot_label",
                                     imageLoader=load_image_from_uri,
                                     kerasOptimizer='adam',
                                     kerasLoss='categorical_crossentropy',
                                     modelFile='/tmp/model-full-tmp.h5' # local file path for model
                                   ) 

Then We've used it for hyperparameter tuning by doing a grid search using CrossValidataor.

paramGrid = (
...
)

evaluator = MulticlassClassificationEvaluator(predictionCol='prediction',
labelCol='label)
cv = CrossValidator(estimator = estimator, estimatorParamMaps = paramGrid, 
                    evaluator = evaluator , numFolds = 2)

cvModel = cv.fit(train)

But it throws an error:

ValueError: Cannot extract any feature from dataset!

Now, docs said only about transfer learning i.e. InceptionV3 and save it as HDFS format. The dataset is well structured for this demo though. Each row represents Image metadata. Let's look.


train.collect()[0]

# output
Row(image=Row(origin='images/9/a00378.png', 
height=180, width=180, nChannels=3, 
mode=16, data=bytearray(b'...),
label=9, one_hot_label=SparseVector(9, {}))

Using cv.fit(train) we pass whole training set to keras estimator, where parameter like inputCol pick image column (the actual image URI), and related labelCol pick one hot vector.

KerasImageFileEstimator( inputCol="image",
outputCol="prediction",
labelCol="one_hot_label",  
imageLoader=load_image_from_uri,...)

It seems that the function of keras_image_file_estimator _getNumpyFeaturesAndLabels unable to catch image URI.

def _getNumpyFeaturesAndLabels(self, dataset):
     image_uri_col = self.getInputCol()
     label_col = None
     if self.isDefined(self.labelCol) and self.getLabelCol() != "":
        label_col = self.getLabelCol()
     tmp_image_col = self._loadedImageCol()
     image_df = self.loadImagesInternal(dataset, image_uri_col).dropna(subset=[tmp_image_col])

     # Extract features
     localFeatures = []
     rows = image_df.collect()
     for row in rows:
        spimg = row[tmp_image_col]
        features = imageStructToArray(spimg)  < - Struct schema to Array
        localFeatures.append(features)

     if not localFeatures: 
          raise ValueError("Cannot extract any feature from dataset!")
     X = np.stack(localFeatures, axis=0)

and Feature extractor imageStructToArray seems here unable to decode.

def imageStructToArray(imageRow):
    """
    Convert an image to a numpy array.
    :param imageRow: Row, must use imageSchema.
    :return: ndarray, image data.
    """
    imType = imageTypeByOrdinal(imageRow.mode)
    shape = (imageRow.height, imageRow.width, imageRow.nChannels)
    return np.ndarray(shape, imType.dtype, imageRow.data)

This is very staight forward and should work like a charm. A key catch is that, docs explains this with InceptionV3 and its trained imagenet weight where we try a custom Keras model with no weight and expect it to train in distributed manner. However, we also tried InceptionV3 and its imagenet wegiht, but no Luck.


@MrBago @mateiz @ahirreddy @marmbrus @pmangg Thank You.