dvgodoy / handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes
MIT License
188 stars 24 forks source link

When column is of type int, histogram acts different from pandas dataframe #22

Open itamar-otonomo opened 4 years ago

itamar-otonomo commented 4 years ago

When I summon a hist from a Pandas column (Series) containing integers I get a proper histogram where the x axis is divided to bins of value ranges. When I do the same using a handy DataFrame I get a categorical histogram.

I dug into the code and the reason for the way handy acts is that the column of integers is not defined as a member of the self._continuous group of columns.

hist uses the continuous list as an indication of using categorical for non continuous. This is why a hist of integers in handy is not what one would expect from a hist of integers in Pandas.

a workaround is to cast the integer column to floats. I think this is a bug (couldn't find anything in the docs).

Here's a quick repro code..

pdf = pd.DataFrame({'bobo': np.random.randint(0, 100, 5000)})
df = spark.createDataFrame(pdf).withColumn('float_bobo', F.col('bobo').astype('float'))
hdf = df.toHandy()
pdf.bobo.hist()
hdf.cols['bobo'].hist()
hdf.cols['float_bobo'].hist()

I forgot to congratulate you on this great lib, it really is cool!

Itamar