Open universalmind303 opened 2 months ago
Is your feature request related to a problem? Please describe. for a column containing URLs, I'd like to parse them and extract relevant components
Describe the solution you'd like
urls = [ "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00004-of-00007.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00005-of-00007.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00006-of-00007.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.arc/train-00000-of-00001.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ary/train-00000-of-00001.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00013-of-00020.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00014-of-00020.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00015-of-00020.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00016-of-00020.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00017-of-00020.parquet", "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00018-of-00020.parquet" ] df = daft.from_pydict({ 'urls': urls }) df.select(col('urls').url.parse()).select(col('url.*')).collect() ╭──────────┬────────────────┬──────────┬────────────┬───────┬────────┬──────────╮ │ fragment ┆ host ┆ password ┆ … ┆ query ┆ scheme ┆ username │ │ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- │ │ Utf8 ┆ Utf8 ┆ Null ┆ (2 hidden) ┆ Utf8 ┆ Utf8 ┆ Null │ ╞══════════╪════════════════╪══════════╪════════════╪═══════╪════════╪══════════╡ │ ┆ huggingface.co ┆ None ┆ … ┆ ┆ https ┆ None │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ ┆ huggingface.co ┆ None ┆ … ┆ ┆ https ┆ None │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ ┆ huggingface.co ┆ None ┆ … ┆ ┆ https ┆ None │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ ┆ huggingface.co ┆ None ┆ … ┆ ┆ https ┆ None │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ ┆ huggingface.co ┆ None ┆ … ┆ ┆ https ┆ None │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ ┆ huggingface.co ┆ None ┆ … ┆ ┆ https ┆ None │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ ┆ huggingface.co ┆ None ┆ … ┆ ┆ https ┆ None │ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ ┆ huggingface.co ┆ None ┆ … ┆ ┆ https ┆ None │ ╰──────────┴────────────────┴──────────┴────────────┴───────┴────────┴──────────╯ (Showing first 8 of 11 rows)
Describe alternatives you've considered UDF functions
Additional context Add any other context or screenshots about the feature request here.
hello, could I take this issue?
@ThanhChinhBK assigning this issue to you!
Is your feature request related to a problem? Please describe. for a column containing URLs, I'd like to parse them and extract relevant components
Describe the solution you'd like
Describe alternatives you've considered UDF functions
Additional context Add any other context or screenshots about the feature request here.