Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.33k stars 164 forks source link

how to flatten/unnest a struct? #2950

Open universalmind303 opened 1 month ago

universalmind303 commented 1 month ago

Is your feature request related to a problem? Please describe. I want to flatten all columns in a struct into the top level. But it seems like I need to manually select all keys to do that.

Describe the solution you'd like


urls = [
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00004-of-00007.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00005-of-00007.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00006-of-00007.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.arc/train-00000-of-00001.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ary/train-00000-of-00001.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00013-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00014-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00015-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00016-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00017-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00018-of-00020.parquet"
]

parsed_urls = [{
    'scheme': urlparse(url).scheme,
    'host': urlparse(url).hostname,
    'path': urlparse(url).path,
    'query': urlparse(url).query,
    'fragment': urlparse(url).fragment,
    'username': urlparse(url).username,
    'password': urlparse(url).password,
    'port': urlparse(url).port
} for url in urls]

df = daft.from_pydict({ "parsed_urls": parsed_urls })

I first tried to do this

df.select(col('parsed_urls').struct.get("*"))

but wildcarding does not appear to be supported there.

I also tried .explode

df.explode(col('parsed_urls'))

but that seems to only work on list/fsl

universalmind303 commented 1 month ago

on further experimentation, it looks like this works,

df.select(col('parsed_urls.*'))

but i think having it also work on .struct.get('*') would be more intuitive or even a .struct.explode() would also be nice

The main reason one of these would be nice is if you have a function that returns a struct, you need to do multiple .select statements to flatten it.

for example:


@daft.udf(return_dtype=daft.DataType.struct({
    'scheme': daft.DataType.string(),
    'host': daft.DataType.string(),
    'path': daft.DataType.string(),
}))
def parse_url(url: daft.Series):
    parsed_urls = []
    for u in url.to_pylist():
        parsed = urlparse(u)
        parsed_urls.append({
            'scheme': parsed.scheme,
            'host': parsed.netloc,
            'path': parsed.path,
        })
    return daft.Series.from_pylist(parsed_urls)

df.select(parse_url(col('urls'))).select(col('urls.*'))

when instead it'd be nice to just chain it and do everything in a single .select

df.select(parse_url(col('urls')).struct.get('*'))
df.select(parse_url(col('urls')).struct.unnest())