abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
261 stars 17 forks source link

String Cleaning #152

Closed CangyuanLi closed 1 month ago

CangyuanLi commented 1 month ago

Partially addresses #151. Adds:

  1. remove_diacritics- strip diacritics (e.g. è -> e)
  2. normalize_string- apply Unicode normalization
  3. map_words- replace words with values. This is faster than using word boundaries in regex (\b\b)
  4. normalize_whitespace- normalize whitespace to one, e.g. (a b -> a b)
  5. replace_digits- replace digits with specified values