Decide Function implementations

TomBurdge commented 4 months ago

All functions, whether they should be implemented. Can update this issue as DONE when each is done.

~~append_if_schema_identical - Not Planned. Horizontal concat exists in polars standard library.~~
dataframe_helpers
~~column_to_list - DONE. python. (this is just polars select, collect, to_list, much more convenient than spark).~~
two_columns_to_dictionary - TODO. python
to_list_of_dictionaries - in polars. (check)
show_output_to_df - TODO. rename to something like dataframe_output_to_df python/rust.
create_df - in polars native.

single_space - TODO. rust.
remove_all_whitespace - TODO. rust.
anti_trim - TODO. rust.
remove_non_word_characters - TODO. rust.
exists - Propose not implement bc don't want to allow for an arbitrary python function as the callable. This requires inefficient map which will tank performance.
forall - see exists.
multi_equals - TODO. rust.
week_start_date - might be covered by polars_xdt.
week_end_date see week_start_date.
approx_equal - TODO. rust.
array_choice - TODO. rust. (interesting one, will be some way to do seed with a crate.)
business_days_between - covered by workday_count in polars_xdt.
uuid5 - can cover in faux_lars.
is_falsy - TODO. rust.
is_truthy - TODO. rust.
is_false - TODO. rust.
is_true - TODO. rust.
is_null_or_blank - TODO. rust.
is_not_in - might be covered by native polars.
null_between - TODO. rust.

Propose skipping. Undocumented. Wouldn't even make a very good log parser.

rand_laplace - generate random numbers with Laplace(mu, beta) - put into faux_lars if is worthwhile.
div_or_else - TODO. rust. Question - what about the other edge cases of IEEE-754 - dividing by a minus number. Numerator is 0 etc. Can just decide/pass arg.

split_col - quinn function expects only one delimiter, which may not be true... I prefer how split polars does it, where it returns an array column. Propose not implement, or add but with an index position argument.

with_columns_renamed - covered by polars.DataFrame|LazyFrame.rename in polars.
with_some_columns_renamed see with_columns_renamed.
snake_case_col_names TODO. python. Needs more robust tests than quinn.
sort_columns I think this is covered by polars sort?
flatten_struct TODO. python.
flatten_map There is no map type in polars.
flatten_dataframe Could possibly not implement, I don't think this handles things like deeply nested structs; less robust than it sounds. Nested structs exist in pyspark but I'm not sure in polars...

LyndonFan commented 4 months ago

Some ideas / comments:

dataframe_helpers

to_list_of_dictionaries: this is just polars to_dicts
exists: either not implement it as suggested, or show a warning similar to polars map_groups?
is_not_in: Python -- maybe make use of polars.any_horizontal?

schema_helps

schema_from_csv: the function looks like creating the schema from a csv, likely for processing another large data file.
- so our signature would be (filepath: str | Path) -> OrderedDict
- and result would be like OrderedDict([("name", pl.Utf8), ("age": pl.Int8)])

transformations

flatten_dataframe: Polars does support nested structs. Paraphrasing from stack overflow, with a combination of list.to_struct and unnest, this is doable. However, doing this in Python (with polars) might have performance issues if it's deeply nested and/or one of the elements has a long list.

TomBurdge commented 4 months ago

Thanks @LyndonFan !

to_list_of_dictionaries: thanks for checking this, I propose not to implement.
exists: I dislike pyspark udfs for most uses, and really don't like pandas apply. I would prefer not to implement; kind of undermines the idea of polars plugins for relatively convenient and fast extensibility.
is_not_in: sounds good! Yeah, I think if it's fine if it's possible in polars native. It's more about, as a developer productivity module, would/could it help with that? I think since I didn't know how to do it trivially, the answer is yes and therefore should implement.

schema_helps

schema_from_csv: that makes a lot of sense, I hadn't looked closely at the function. Propose to implement.
flatten_dataframe: hmm, let me have a think; need to start work but anything nested does my head in a bit! So I'm glad polars doesn't implement. I suppose if there is another use with arrays and is actually useful and performant, then should implement.

I've been working on main, but will do proper PRs from now!

TomBurdge / harley