Open ukclivecox opened 2 days ago
Sampling number of rows makes sense to me.
Ideally, this would be repeatable, i.e. allow one to set a seed. This would allow sampling 1 table with joins and then taking the rows from other tables as needed with the same sampling joined rows.
It's worth noting that we can already do this today via the seed
parameter: https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/dataframe_methods/daft.DataFrame.sample.html
Is your feature request related to a problem?
Generally, I know the max rows I would like to retrieve and need a sample of a given dataset for this rather than some percentage.
Describe the solution you'd like
Examples:
DuckDB allow specification of max rows Clickhouse has a fuzzy max rows "at least but not much more"
Describe alternatives you've considered
do something like
Maybe this is inefficient as compared to one that returns given number of rows as part of sampling?
Additional Context
Ideally, this would be repeatable, i.e. allow one to set a seed. This would allow sampling 1 table with joins and then taking the rows from other tables as needed with the same sampling joined rows.
Would you like to implement a fix?
No