Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

Repeatable sampling by max rows #3332

Open ukclivecox opened 2 days ago

ukclivecox commented 2 days ago

Is your feature request related to a problem?

Generally, I know the max rows I would like to retrieve and need a sample of a given dataset for this rather than some percentage.

Describe the solution you'd like

Examples:

DuckDB allow specification of max rows Clickhouse has a fuzzy max rows "at least but not much more"

Describe alternatives you've considered

do something like

maxRows = 120000
rows = df.count_rows()
if maxRows < rows:
    sample_percent = maxRows/float(rows)
    df = df.sample(sample_percent).limit(maxRows)

Maybe this is inefficient as compared to one that returns given number of rows as part of sampling?

Additional Context

Ideally, this would be repeatable, i.e. allow one to set a seed. This would allow sampling 1 table with joins and then taking the rows from other tables as needed with the same sampling joined rows.

Would you like to implement a fix?

No

desmondcheongzx commented 2 days ago

Sampling number of rows makes sense to me.

Ideally, this would be repeatable, i.e. allow one to set a seed. This would allow sampling 1 table with joins and then taking the rows from other tables as needed with the same sampling joined rows.

It's worth noting that we can already do this today via the seed parameter: https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/dataframe_methods/daft.DataFrame.sample.html