Feat/polars with columns , async with_columns pandas

jernejfrank commented 1 week ago

Please let me know if scope creep too big and I can cut some things out into a new PR.

Changes

extract_columns for polars LazyFrame (converts columns into Expressions that just need to be collected at the end)
with_columns for polars (supported eager and lazy execution -- that's why the above change)
refactored with_columns by using registry single dispatch method and enabled it for Pandas / Polars for now.
async with_columns for pandas
target=None selects all viable sink nodes and appends to data frame

How I tested this

unit
e2e
examples

Notes

The one thing I didn't touch is the spark extension (zero experience with pyspark), because the implementation so far is different / not sure if we can use extract_columns as straightforward as with pandas/polars, but happy to tackle that as well in case its possible.

Checklist

[x] PR has an informative and human-readable title (this will be pulled into the release notes)
[x] Changes are limited to a single goal (no scope creep)
[x] Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
[x] Any change in functionality is tested
[x] New functions are documented (with a description, list of inputs, and expected output)
[x] Placeholder code is flagged / future TODOs are captured in comments
[x] Project documentation has been updated if adding/changing functionality.

jernejfrank commented 6 days ago

I left with_columns in h_pandas for backwards compatibility and have something similar in h_polars to be consistent, but IMO we should deprecate h_pandas (super short lived plugin lol), remove it from h_polars and keep it central in recursive.py.

jernejfrank commented 3 days ago

Ok, so here is my line of thinking with regards to the changes:

The with_columns consists of three parts:

We need the input node. This can be a full dataframe if pass_datafame_as is used or we need to extract columns into nodes if columns_to_pass is used. Given that some data frames are supported in Hamilton's extract_columns and some are not, this should be implemented on a per library basis.
We need the subtag nodes. Again, we can re-use Hamilton's subdag functionality, but some libraries again will need more (see h_spark) and is therefore again to be implemented on a per library basis.
Last is combining eveything into a single dataframe again to be implemented on a per library basis.

So what I decided is to leave three abstract methods:

get_initial_nodes
get_subdag_nodes
create_merge_node

that should create enough flexibility to implement any dataframe library, but is also concrete enough to wire together everything in inject_nodes from NodeInjector.

Now, every plugin library, h_pandas, h_polars, and h_polars_lazyframe inherits from this class and in the their initialisation calls out to the parent factory init, but passes in the required dataframe type (e.g. pd.DataFrame, pl.DataFrame, or pl.LazyFrame) which is in turn derived from the extension modules. So in effect we use the registry approach without hard-binding us to needing to implementat any functionality in there.

Since that part of the API is private, should we want to switch to registry, the refactoring is straightforward and shouldn't get us into trouble down the road.

elijahbenizzy commented 2 days ago

Ok, so here is my line of thinking with regards to the changes:

The with_columns consists of three parts:

We need the input node. This can be a full dataframe if pass_datafame_as is used or we need to extract columns into nodes if columns_to_pass is used. Given that some data frames are supported in Hamilton's extract_columns and some are not, this should be implemented on a per library basis.

We need the subtag nodes. Again, we can re-use Hamilton's subdag functionality, but some libraries again will need more (see h_spark) and is therefore again to be implemented on a per library basis.

Last is combining eveything into a single dataframe again to be implemented on a per library basis.

So what I decided is to leave three abstract methods:

get_initial_nodes

get_subdag_nodes

create_merge_node

that should create enough flexibility to implement any dataframe library, but is also concrete enough to wire together everything in inject_nodes from NodeInjector.

Now, every plugin library, h_pandas, h_polars, and h_polars_lazyframe inherits from this class and in the their initialisation calls out to the parent factory init, but passes in the required dataframe type (e.g. pd.DataFrame, pl.DataFrame, or pl.LazyFrame) which is in turn derived from the extension modules. So in effect we use the registry approach without hard-binding us to needing to implementat any functionality in there.

Since that part of the API is private, should we want to switch to registry, the refactoring is straightforward and shouldn't get us into trouble down the road.

Nice, I think this is a good overview. Note there might still be shared stuff between the implementations, in which case you have two options to reduce duplicated code (should you want):

Joint/helper functions
Additional subclasses, E.G. for column-specific ones (polars/pandas)

But I think these will be options for later, and quite possibly over-engineered.

To be clear, this doesn't look like it works with spark, yet? Do you think that's a possibility?

DAGWorks-Inc / hamilton