Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.4k stars 170 forks source link

`url.parse` function #2951

Open universalmind303 opened 2 months ago

universalmind303 commented 2 months ago

Is your feature request related to a problem? Please describe. for a column containing URLs, I'd like to parse them and extract relevant components

Describe the solution you'd like

urls = [
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00004-of-00007.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00005-of-00007.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00006-of-00007.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.arc/train-00000-of-00001.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ary/train-00000-of-00001.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00013-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00014-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00015-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00016-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00017-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00018-of-00020.parquet"
]

df = daft.from_pydict({ 'urls': urls })

df.select(col('urls').url.parse()).select(col('url.*')).collect()

╭──────────┬────────────────┬──────────┬────────────┬───────┬────────┬──────────╮
│ fragment ┆ host           ┆ password ┆      …     ┆ query ┆ scheme ┆ username │
│ ---      ┆ ---            ┆ ---      ┆            ┆ ---   ┆ ---    ┆ ---      │
│ Utf8     ┆ Utf8           ┆ Null     ┆ (2 hidden) ┆ Utf8  ┆ Utf8   ┆ Null     │
╞══════════╪════════════════╪══════════╪════════════╪═══════╪════════╪══════════╡
│          ┆ huggingface.co ┆ None     ┆ …          ┆       ┆ https  ┆ None     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│          ┆ huggingface.co ┆ None     ┆ …          ┆       ┆ https  ┆ None     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│          ┆ huggingface.co ┆ None     ┆ …          ┆       ┆ https  ┆ None     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│          ┆ huggingface.co ┆ None     ┆ …          ┆       ┆ https  ┆ None     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│          ┆ huggingface.co ┆ None     ┆ …          ┆       ┆ https  ┆ None     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│          ┆ huggingface.co ┆ None     ┆ …          ┆       ┆ https  ┆ None     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│          ┆ huggingface.co ┆ None     ┆ …          ┆       ┆ https  ┆ None     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│          ┆ huggingface.co ┆ None     ┆ …          ┆       ┆ https  ┆ None     │
╰──────────┴────────────────┴──────────┴────────────┴───────┴────────┴──────────╯
(Showing first 8 of 11 rows)

Describe alternatives you've considered UDF functions

Additional context Add any other context or screenshots about the feature request here.

ThanhChinhBK commented 3 weeks ago

hello, could I take this issue?

samster25 commented 3 weeks ago

@ThanhChinhBK assigning this issue to you!