databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
1.99k stars 494 forks source link

KerasImageFileEstimator api cannot work with dataset as explained in keras_image_file_estimator.py #107

Open demetsude opened 6 years ago

demetsude commented 6 years ago

Hi, I am using sparkdl module from databricks. I am trying to run an application using KerasImageFileEstimator. I am using the example explained in the keras_image_file_estimator.py which creates a dataset by stringIndexer = StringIndexer(inputCol="imageLabel", outputCol="categoryIndex") indexed_dateset = stringIndexer.fit(original_dataset).transform(original_dataset) encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec") image_dataset = encoder.transform(indexed_dateset) I am getting error when I run transformers = estimator.fit(image_dataset) and the error is _keras_label = row[label_col].array AttributeError: 'SparseVector' object has no attribute 'array' As far as I understand, the problem is OneHotEncoder returns a SparseVector (categoryVec) and SparseVector which is row[label_col] here does not have an attribute called array. Error raised from the _getNumpyFeaturesAndLabels function in keras_image_file_estimator.py.

I could not find a solution to this. So if you can help me, I would be glad.

yogeshg commented 6 years ago

Thanks for raising this issue! Based on the context and error message, we think that https://github.com/databricks/spark-deep-learning/pull/125 should fix it. If you need an immediate work around, you might have to ensure that the column represented by labelCol Param of your estimator ("categoryVector" ?) is a DenseVector by applying some custom udf to it.