Open colin-ho opened 2 months ago
Can I contribute to this issue?
@dmaymay of course! you may want to reference https://github.com/Eventual-Inc/Daft/pull/2041
Hey @dmaymay, any updates on this issue? Let us know if you need anything!
Hi @colin-ho, I was away for a bit, I worked on a solution that works, but only takes f64 type as the bounds. I'm working on a solution that will take int types as well, I'll share what I have so far soon in case it's needed.
Also, this implementation does not change the type of the DataArray, so, when an int is clipped by a float, the float would be converted into an int first. I'm not sure what the desired behaviour is here.
Yes, the type of the DataArray should not change. For the special case of an int clipped by a float, the int should be converted to float to avoid loss of precision. You can checkout https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/datatypes/binary_ops.rs#L225 to see the appropriate types to cast to for comparisons!
Hey @colin-ho , ngl this is kicking my ass. After writing a probably very convoluted solution to returning different types conditionally. I still ran into a mismatched types error when calling .clip() from python:
data_dict = { 'x': [10.0, 20.25, 30.7, 42.322,float("nan") ], 'y': [10, 20, 30, 40, None] } df = daft.from_pydict(data_dict) new_df = df.with_column('y', df['y'].cast(DataType.int16())).collect() print(new_df.select(new_df["x"].clip(19.3,22.9)).collect()) print(new_df.select(new_df["y"].clip(15,30.3)).collect())
Gives me
thread '<unnamed>' panicked at 'Mismatch of expected expression data type and data type from computed series, Int16 vs Float32', src/daft-table/src/lib.rs:403:13
....
PanicException: Mismatch of expected expression data type and data type from computed series, Int16 vs Float32
I'm at a loss here, I'm not sure where the expected type is failing, or how to proceed. Any pointers you can give me? Any similar examples of returning different DataArray types conditionally?
Hey @dmaymay ! Yeah don't sweat it, the types can get pretty gnarly. Here are some pointers that may help:
min
and max
parameters as expressions, which will be evaluated to series. this way you don't have to deal with primitive types until you get to the array level. (to keep it simple, you can just assume that both the input column, min, and max are the same type, we can deal with different types later)Lastly, feel free to open a draft pull request, don't worry if it doesn't work! That way we can give feedback directly on your code
clip(lower=None, upper=None)
Replace values outside of lower and upper bounds with these bounds. NULL values are preserved and are not replaced with bounds.
Example: