dvgodoy / handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes
MIT License
185 stars 23 forks source link

Access to filter method in handyspark API #12

Closed maresk closed 5 years ago

maresk commented 5 years ago

Thanks for the great work!! , not sure why every other such module converts to Pandas by default, defeats the purpose imo. I'm trying it out currently.

I have a question around filtering a dataframe rows in Handy: Is it currently possible to filter rows based on column values directly instead of creating a column output and assigning it back as a new column in the dataframe ? Would be great to have a direct filter capability in the API or any workaround that doesn't need the user to use low level spark calls for filtering.

dvgodoy commented 5 years ago

Hi,

Thanks! :-) So, regarding the filtering, you can use Spark's own expr function to easily come up with expressions that will return the filter criteria. For instance, if you would like to get only rows where Age is greater than 25:

from pyspark.sql import functions as F
sdf.filter(F.expr('Age > 25'))

There is no need to create a column, Spark will evaluate the expression and filter it accordingly.

But if you want to do something fancier, you can leverage the pandas object of HandySpark as well. Let's say you want to filter only for a list of specific values:

hdf.filter(hdf.pandas['Age'].isin(values=[25, 26]))

HandySpark implements several column methods available in Pandas, like isin.

maresk commented 5 years ago

Great Thanks, that worked, good to know I can filter within Handy.