abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
261 stars 17 forks source link

Sanitize text expression #151

Closed julio-34727 closed 1 month ago

julio-34727 commented 1 month ago

Thank for your excellent plugin.

Is it possible to add an expression in the str namespace to clean up some text. I already use this expression in python by combining the polars expressions, duckdb (for accents) and pyarrow (for normalization) but it would be interesting to have it in Rust without going through the different libraries.

The idea is to remove for example emojis (if emoji=True), accents (if accent=True), fill_na (replace empty strings and r"\s+") and so on...

import polars as pl
from collections.abc import Sequence
from typing import Final, Literal

def sanitize_text(
    s: pl.Expr,
    *,
    norm: Literal["NFD", "NFKD", "NFC", "NFKC"] | None = None,
    case: Literal["lower", "upper", "capitalize", "title", "snake"] | None = None,
    emoji: bool = False,
    elision: bool = False,
    url: bool = False,
    accent: bool = False,
    fill_na: str | None = "",
    mappings: Sequence[tuple[str, str]] | None = None,
) -> pl.Expr: ...

case = "snake" is not necessary (bonus) Exemple of mappings: mappings = [(r"[\x00\u200d]+", ""), (r"[\xa0\x0b\u200e\n\r\t\f]+", " "), (r"\s\s+", " ")]

EMOJI_REGEX: Final = r"[\p{So}]"
ELISION_REGEX: Final = r"""(?x)(?i)
(?:
    (?:l')|
    (?:m')|
    (?:t')|
    (?:qu')|
    (?:n')|
    (?:s')|
    (?:j')|
    (?:d')|
    (?:c')|
    (?:jusqu')|
    (?:quoiqu')|
    (?:lorsqu')|
    (?:puisqu')
)
(?=[aeiouy])
"""
URL_REGEX: Final = r"""(?x)(?i)
\b
(
    (?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)
    (?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+
    (?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”``])
)
"""
abstractqqq commented 1 month ago

Thank you for the request!

@CangyuanLi is working on something like this for this package, and changes will be gradually merged in, and this is the first PR in this direction. https://github.com/abstractqqq/polars_ds_extension/pull/150

For snake case, polars_ds right now already has a function called to_snake_case.

import polars_ds as pds

df.select(pds.to_snake_case("column_name"))

should work.

For URL related stuff, I actually have a second project called polars_istr, which is for Identification String parsing, which aims to help with common standard format strings parsing tasks. Take a look here

abstractqqq commented 1 month ago

v0.4.5 should have the changes @CangyuanLi added, which partially addressed this issue