Closed TomBurdge closed 4 months ago
Some ideas / comments:
dataframe_helpers
to_list_of_dictionaries
: this is just polars to_dicts
exists
: either not implement it as suggested, or show a warning similar to polars map_groups?is_not_in
: Python -- maybe make use of polars.any_horizontal
?schema_helps
schema_from_csv
: the function looks like creating the schema from a csv, likely for processing another large data file.
(filepath: str | Path) -> OrderedDict
OrderedDict([("name", pl.Utf8), ("age": pl.Int8)])
transformations
flatten_dataframe
: Polars does support nested structs. Paraphrasing from stack overflow, with a combination of list.to_struct
and unnest
, this is doable. However, doing this in Python (with polars) might have performance issues if it's deeply nested and/or one of the elements has a long list.Thanks @LyndonFan !
to_list_of_dictionaries
: thanks for checking this, I propose not to implement.exists
: I dislike pyspark udf
s for most uses, and really don't like pandas apply. I would prefer not to implement; kind of undermines the idea of polars plugins for relatively convenient and fast extensibility.is_not_in
: sounds good! Yeah, I think if it's fine if it's possible in polars native. It's more about, as a developer productivity module, would/could it help with that? I think since I didn't know how to do it trivially, the answer is yes and therefore should implement.schema_helps
schema_from_csv
: that makes a lot of sense, I hadn't looked closely at the function. Propose to implement.flatten_dataframe
: hmm, let me have a think; need to start work but anything nested does my head in a bit! So I'm glad polars doesn't implement. I suppose if there is another use with arrays and is actually useful and performant, then should implement.I've been working on main, but will do proper PRs from now!
All functions, whether they should be implemented. Can update this issue as DONE when each is done.
append if schema identical
append_if_schema_identical
- Not Planned. Horizontal concat exists in polars standard library.dataframe_helpers
column_to_list
- DONE. python. (this is just polarsselect
,collect
,to_list
, much more convenient than spark).two_columns_to_dictionary
- TODO. pythonto_list_of_dictionaries
- in polars. (check)show_output_to_df
- TODO. rename to something likedataframe_output_to_df
python/rust.create_df
- in polars native.dataframe validator
validate_presence_of_columns
- DONE. python.validate_schema
- DONE. python.validate_absence_of_columns
- TODO. python.Functions
single_space
- TODO. rust.remove_all_whitespace
- TODO. rust.anti_trim
- TODO. rust.remove_non_word_characters
- TODO. rust.exists
- Propose not implement bc don't want to allow for an arbitrary python function as the callable. This requires inefficient map which will tank performance.forall
- seeexists
.multi_equals
- TODO. rust.week_start_date
- might be covered bypolars_xdt
.week_end_date
seeweek_start_date
.approx_equal
- TODO. rust.array_choice
- TODO. rust. (interesting one, will be some way to do seed with a crate.)business_days_between
- covered byworkday_count
inpolars_xdt
.uuid5
- can cover infaux_lars
.is_falsy
- TODO. rust.is_truthy
- TODO. rust.is_false
- TODO. rust.is_true
- TODO. rust.is_null_or_blank
- TODO. rust.is_not_in
- might be covered by native polars.null_between
- TODO. rust.Keyword Finder
Propose skipping. Undocumented. Wouldn't even make a very good log parser.
math(s)
rand_laplace
- generate random numbers withLaplace(mu, beta)
- put intofaux_lars
if is worthwhile.div_or_else
- TODO. rust. Question - what about the other edge cases ofIEEE-754
- dividing by a minus number. Numerator is 0 etc. Can just decide/pass arg.Schema Helpers
print_schema_as_code
- TODO. python.schema_from_csv
- TODO. python. (is this just scan_csv(path).schema ?)complex_fields
- TODO. python.Split columns
split_col
-quinn
function expects only one delimiter, which may not be true... I prefer howsplit
polars does it, where it returns an array column. Propose not implement, or add but with an index position argument.Transformations
with_columns_renamed
- covered bypolars.DataFrame|LazyFrame.rename
in polars.with_some_columns_renamed
seewith_columns_renamed
.snake_case_col_names
TODO. python. Needs more robust tests thanquinn
.sort_columns
I think this is covered by polars sort?flatten_struct
TODO. python.flatten_map
There is nomap
type in polars.flatten_dataframe
Could possibly not implement, I don't think this handles things like deeply nested structs; less robust than it sounds. Nested structs exist in pyspark but I'm not sure in polars...