dvgodoy / handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes
MIT License
185 stars 23 forks source link

boxplot is failing #26

Closed DrYSG closed 2 years ago

DrYSG commented 3 years ago

I am using PySpark version 3.01 on DataBricks 7.4.

I am getting this error when trying to do a boxplot (histograms work fine). I have tried manually casting DISTANCE as both as integer and a double, but both fail:

AnalysisException: cannot resolve 'approx_percentile(`DISTANCE`, CAST(0.25BD AS DOUBLE), 100.0BD)' due to data type mismatch: argument 3 requires integral type, however, '100.0BD' is of decimal(4,1) type.; line 1 pos 0;
---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<command-2120656041886569> in <module>
      5 hdf.cols["ORIGIN_AIRPORT"].hist(ax=axs[1,0])
      6 hdf.cols["DESTINATION_AIRPORT"].hist(ax=axs[1,1])
----> 7 hdf.cols["DISTANCE"].boxplot(ax=axs[2,0])
      8 hdf.cols["plannedDepartTime"].boxplot(ax=axs[2,1])
root
 |-- dayOfWeek: string (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- DISTANCE: double (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- plannedDepartTime: integer (nullable = true)
 |-- label: integer (nullable = true)
dvgodoy commented 3 years ago

Hi,

Thanks for reporting this, but unfortunately Spark 3.0 is not supported by HandySpark, as I replied in the other issue (#25).

Best, Daniel