dvgodoy / handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes
MIT License
185 stars 23 forks source link

how to get get bins and counts instead of a plot #13

Closed maresk closed 5 years ago

maresk commented 5 years ago

Getting this error when trying to get back bins and counts instead of a histogram plot... module 'handyspark.plot' has no attribute 'stratified_histogram'

but other modules exist

help(handy.plot.histogram) histogram(sdf, colname, bins=10, categorical=False, ax=None)

The doc pages lists this method in the plot module however ! Is there a recommended way of getting the bins and counts instead of a plot ?

dvgodoy commented 5 years ago

Hi,

The stratified_histogram function is meant to be called from inside the stratified dataframe, to generate data to build the plot. But you still can get access to the bins and counts with a simple workaround:

from handyspark.plot import strat_histogram
bins, counts = strat_histogram(hdf.stratify('Pclass')._df, colname='Age')

Since strat_histogram needs a stratified dataframe, it is possible to stratify it first and then access the reference to the internal dataframe with _df.

In a future version, I should make it easier to get the bins and counts. Thanks for pointing this out!

maresk commented 5 years ago

Tried your suggestion, but I am getting a single array of bins and counts in a dataframe. The dataset has a column Properties that has abut 15 categorical values . After stratifying on Properties, I am looking to get the histogram bins and counts for each of the properties on a column yield_at_starttime which is a continuous float variable.

I am expecting one array of bins (all categories are binned against the same bin start values) and 15 arrays for the counts, one for each category but instead I am getting the following


bins, counts = strat_histogram(hdf.stratify('Properties')._df, colname='yield_at_starttime')

bins
array([0.        , 0.39997528, 0.79995056, 1.19992584, 1.59990112,
       1.9998764 , 2.39985168, 2.79982696, 3.19980224, 3.59977752,
       3.9997528 ])

counts
Out[7]: 
   __yield_at_starttime_bucket   count

0                                             0  838371
1                                             1   32142
2                                             2   11496
3                                             3    6112
4                                             4    3640
5                                             5    2672
6                                             6    1839
7                                             7    1316
8                                             8    1039
9                                             9     861

Since the strat_histogram code is being used internally to generate the histogram plot, wondering if I am missing something here in the correct usage.

dvgodoy commented 5 years ago

You're absolutely right... it turns out, the workaround needs to be a bit hackier... unfortunately, one of the properties is only set when the method (in this case, hist()) is called... so we need to set it manually first and clear it afterwards:

strata = hdf.stratify('Pclass')
strata._handy._strata = ['Pclass']
bins, counts = strat_histogram(strata._df, colname='Age')
strata._handy._clear_stratification()
bins, counts

For the Titanic example, it returns something like this - the bins and a dataframe with all the counts, including the stratified column:

(array([ 0.42 ,  8.378, 16.336, 24.294, 32.252, 40.21 , 48.168, 56.126,
        64.084, 72.042, 80.   ]),    
Pclass  __Age_bucket  count
 0        1             0    3.0
 1        2             0   17.0
 2        3             0   34.0
 3        1             1    6.0
 4        2             1    4.0
 5        3             1   36.0
 6        1             2   30.0
 7        2             2   37.0
 8        3             2  110.0
 9        1             3   29.0
 10       2             3   47.0
 11       3             3   93.0
 12       1             4   42.0
 13       2             4   34.0
 14       3             4   42.0
 15       1             5   27.0
 16       2             5   15.0
 17       3             5   28.0
 18       1             6   27.0
 19       2             6   12.0
 20       3             6    6.0
 21       1             7   16.0
 22       2             7    5.0
 23       3             7    3.0
 24       1             8    5.0
 25       2             8    2.0
 26       3             8    2.0
 27       1             9    1.0
 28       2             9    0.0
 29       3             9    1.0)
maresk commented 5 years ago

Great, confirming that worked !

My two cents is that many such modules try to plot this data directly but given the large ecosystem of plotting packages in Python, it may actually be counterproductive. For me, the matplotlib output was too packed (15 category histograms) and the figure actually wasn't actually showing the histograms as it was too large. Since this is a module folks would use to process large datasets, the most generally usable output would probably be the processed data. The ideal case would be to have access to both the data and the plot to use at the users discretion.

Thanks again!

dvgodoy commented 5 years ago

Thanks for the feedback. You do have a point, I will make it easier to get only the data without the plot. And thanks for using HandySpark :-)