Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.71k stars 98 forks source link

[EXPRESSIONS] `.clip` #1907

Open colin-ho opened 2 months ago

colin-ho commented 2 months ago

clip(lower=None, upper=None)

Replace values outside of lower and upper bounds with these bounds. NULL values are preserved and are not replaced with bounds.

Example:

df = daft.from_pydict({"a":[None,1,2,3,4,5]})
df = df.select(df["a"].clip(2,4))
# result = [None,2,2,3,4,4]
dmaymay commented 1 month ago

Can I contribute to this issue?

samster25 commented 1 month ago

@dmaymay of course! you may want to reference https://github.com/Eventual-Inc/Daft/pull/2041

colin-ho commented 3 weeks ago

Hey @dmaymay, any updates on this issue? Let us know if you need anything!

dmaymay commented 2 weeks ago

Hi @colin-ho, I was away for a bit, I worked on a solution that works, but only takes f64 type as the bounds. I'm working on a solution that will take int types as well, I'll share what I have so far soon in case it's needed.

Also, this implementation does not change the type of the DataArray, so, when an int is clipped by a float, the float would be converted into an int first. I'm not sure what the desired behaviour is here.

colin-ho commented 2 weeks ago

Yes, the type of the DataArray should not change. For the special case of an int clipped by a float, the int should be converted to float to avoid loss of precision. You can checkout https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/datatypes/binary_ops.rs#L225 to see the appropriate types to cast to for comparisons!

dmaymay commented 1 week ago

Hey @colin-ho , ngl this is kicking my ass. After writing a probably very convoluted solution to returning different types conditionally. I still ran into a mismatched types error when calling .clip() from python:

data_dict = { 'x': [10.0, 20.25, 30.7, 42.322,float("nan") ], 'y': [10, 20, 30, 40, None] } df = daft.from_pydict(data_dict) new_df = df.with_column('y', df['y'].cast(DataType.int16())).collect() print(new_df.select(new_df["x"].clip(19.3,22.9)).collect()) print(new_df.select(new_df["y"].clip(15,30.3)).collect())

Gives me

thread '<unnamed>' panicked at 'Mismatch of expected expression data type and data type from computed series, Int16 vs Float32', src/daft-table/src/lib.rs:403:13....

PanicException: Mismatch of expected expression data type and data type from computed series, Int16 vs Float32

I'm at a loss here, I'm not sure where the expected type is failing, or how to proceed. Any pointers you can give me? Any similar examples of returning different DataArray types conditionally?

colin-ho commented 1 week ago

Hey @dmaymay ! Yeah don't sweat it, the types can get pretty gnarly. Here are some pointers that may help:

Lastly, feel free to open a draft pull request, don't worry if it doesn't work! That way we can give feedback directly on your code