Closed pdet closed 1 week ago
Please take into account this issue https://github.com/duckdb/duckdb/issues/6669 as that is what inspired the FindFirstNonNull method
Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method
This should still work, no?
Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method
This should still work, no?
Can you explain how? This PR pretty much entirely undoes https://github.com/duckdb/duckdb/pull/9811 which is the PR that was made to address this issue
If the analyzer can only find nulls at the given offset, the resulting type is null That reintroduces the problem of #6669, does it not?
Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method
This should still work, no?
Can you explain how? This PR pretty much entirely undoes #9811 which is the PR that was made to address this issue
If the analyzer can only find nulls at the given offset, the resulting type is null That reintroduces the problem of #6669, does it not?
Not really, if the values sniffed are all null and we did not sniff the full dataframe, it now defaults to a varchar.
That just means that this does not break the referenced issue because the columns happen to be VARCHAR, that's not equivalent
I'm fine merging this, I agree that this regression is a problem, but we do have to be aware of the behavior we're throwing away here and the problems that will inevitably cause
Maybe there's a faster way of checking for NULL values instead of calling get_item
for every row (which is very slow)?
e.g. maybe we could use https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html - although ideally we wouldn't do that for the entire DataFrame range
How about we do that as a fallback - if the result type is SQLNULL
(i.e. we only found null values), we call isnull
or similar to find any non-NULL values in the column (if there are any) and take their type instead.
How about we do that as a fallback - if the result type is
SQLNULL
(i.e. we only found null values), we callisnull
or similar to find any non-NULL values in the column (if there are any) and take their type instead.
Sure, I'm happy to try/benchmark that :-)
@Mytherin I've modifed the code to essentially use __getitem__(first_valid_index())
.
The time on the benchmark when executing a query on top of the NYC Taxi dataframe is now: 0.22
So one order of magnitude faster than main, but one order of magnitude slower than just defaulting to varchar
.
One thing to notice is that the query that is running does not require the columns that are object types. Maybe one thing we should consider doing is performing projection pushdown and the dataframe.
One thing to notice is that the query that is running does not require the columns that are object types. Maybe one thing we should consider doing is performing projection pushdown and the dataframe.
The problem is that what we really need in that case is some callback on when we need the type of the column, then we could lazily figure out the type only when required. That's not really supported infrastructure wise in the system right now - but would definitely be interesting to add.
@Mytherin The CI here is failing because the previous implementations has a bug on the benchmark I've added
Ah yeah, good point. LGTM in that case
The Pandas Analyzer code would ignore the sample size limit for null values when sniffing data types from
object
type columns. In the case where we have an object column where most (or all) values are null, the whole column would be sniffed.Now the null values are not skipped, and if we sampled the file, we upgrade the type from null to varchar, if only nulls were found.
This improves the scan of the NYC Taxi dataset from a dataframe by 2 orders of magnitude.
Old Time: 1.72 New Time: 0.025545666925609112
Sample benchmark: