databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
2k stars 494 forks source link

When training on large data set : PicklingError? #163

Open MrDataPsycho opened 6 years ago

MrDataPsycho commented 6 years ago

I was try trying to train one of my own build model on a gpu cluster using pyspark. With smaller sample the training was successful. But when I run it for 30954 images I am getting the following error:

# that does not run for 30954 images
paramGrid = (
  ParamGridBuilder()
  .addGrid(estimator.kerasFitParams, [{"batch_size": 16, "verbose": 0},
                                      {"batch_size": 32, "verbose": 0}])
  .build()
)
mc = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label" )
cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, evaluator=mc, numFolds=2)
cvModel = cv.fit(train_df)

... ... INFO:tensorflow:Froze 0 variables. Converted 0 variables to const ops. Traceback (most recent call last): File "/databricks/spark/python/pyspark/broadcast.py", line 83, in dump pickle.dump(value, f, 2) OverflowError: cannot serialize a string larger than 4GiB PicklingError: Could not serialize broadcast: OverflowError: cannot serialize a string larger than 4GiB

But when running with the same big sample and using transfer learning and logistic regression it runs.

# that runs for 30954 images
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label")
p = Pipeline(stages=[featurizer, lr])
p_model = p.fit(train_df)