Replace infer_schema_length by infer_schema

josevalim commented 2 months ago

Today infer_schema_length has an awkward API, since setting it to nil is used to infer all columns and 0 is used to disable it.

I propose:

infer_schema: true | false | non_neg_integer()

Where true enables, false disables, and the integer configures the length. The default can be the same as today.

cigrainger commented 2 months ago

I like this, but what would we use for all rows? IIUC true -> default (1000 rows).

josevalim commented 2 months ago

true means all rows.

lei0zhou commented 2 months ago

thanks for improving this! just share a way duckdb did. it has two parameters,

auto_detect: true | false
sample_size: BIGINT (-1, mean all rows, default 20480)

ref: CSV Import – DuckDB CSV Auto Detection – DuckDB

I am more than happy to take a stab at this

ceyhunkerti commented 1 month ago

Today infer_schema_length has an awkward API, since setting it to nil is used to infer all columns and 0 is used to disable it.

I propose:
infer_schema: true | false | non_neg_integer()
Where true enables, false disables, and the integer configures the length. The default can be the same as today.

is it only for csv or should we also change it on load_ndjson ?
Also one strange thing I didn't get is; polars side doesn't seem to have an option to disable schema inference for ndjson

👉🏼 given Option<NonZeroUsize>) to infer schema, what I understand is ;

if it's None will use entire file
else will use len(given) rows
will fail at comptime if you give 0

    /// Set the JSON reader to infer the schema of the file. Currently, this is only used when reading from
    /// [`JsonFormat::JsonLines`], as [`JsonFormat::Json`] reads in the entire array anyway.
    ///
    /// When using [`JsonFormat::JsonLines`], `max_records = None` will read the entire buffer in order to infer the
    /// schema, `Some(1)` would look only at the first record, `Some(2)` the first two records, etc.
    ///
    /// It is an error to pass `max_records = Some(0)`, as a schema cannot be inferred from 0 records when deserializing
    /// from JSON (unlike CSVs, there is no header row to inspect for column names).
    pub fn infer_schema_len(mut self, max_records: Option<NonZeroUsize>) -> Self {
        self.infer_schema_len = max_records;
        self
    }

elixir-explorer / explorer

Replace infer_schema_length by infer_schema #972