Closed bensums closed 7 years ago
Did you set the dataframe persistence to STORAGE_AND_MEMORY? Because Spark will try to fit everything into memory by default (which includes the JVM overhead). That should resolve the issue. Remember it doesn't copy the dataframe row as a whole, but the references + a new column and value. Furthermore, what is the dimensionality of your one hot encoded vector?
I think this is due to
utils.new_dataframe_row
copying whole rows. Why not use a UDF in thetransform
method? I observe this while doing a count on a dataset of only 6 million rows after transforming it withOneHotTransformer
.