duckdb / duckdb

DuckDB is an in-process SQL OLAP Database Management System
http://www.duckdb.org
MIT License
17.74k stars 1.51k forks source link

Modify the pandas analyzer code to always respect the sample size #12097

Closed pdet closed 1 week ago

pdet commented 2 weeks ago

The Pandas Analyzer code would ignore the sample size limit for null values when sniffing data types from object type columns. In the case where we have an object column where most (or all) values are null, the whole column would be sniffed.

Now the null values are not skipped, and if we sampled the file, we upgrade the type from null to varchar, if only nulls were found.

This improves the scan of the NYC Taxi dataset from a dataframe by 2 orders of magnitude.

Old Time: 1.72 New Time: 0.025545666925609112

Sample benchmark:

wget "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2016-01.parquet" -O "tripdata.parquet"
import pandas 
import duckdb

df = pandas.read_parquet("tripdata.parquet")
con = duckdb.connect()
sql = f""" select passenger_count, avg(tip_amount) as tip_amount from df where trip_distance < 5 group by passenger_count order by passenger_count"""
result =  con.execute(sql).df()
Tishj commented 2 weeks ago

Please take into account this issue https://github.com/duckdb/duckdb/issues/6669 as that is what inspired the FindFirstNonNull method

pdet commented 2 weeks ago

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

This should still work, no?

Tishj commented 2 weeks ago

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

This should still work, no?

Can you explain how? This PR pretty much entirely undoes https://github.com/duckdb/duckdb/pull/9811 which is the PR that was made to address this issue

If the analyzer can only find nulls at the given offset, the resulting type is null That reintroduces the problem of #6669, does it not?

pdet commented 2 weeks ago

Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method

This should still work, no?

Can you explain how? This PR pretty much entirely undoes #9811 which is the PR that was made to address this issue

If the analyzer can only find nulls at the given offset, the resulting type is null That reintroduces the problem of #6669, does it not?

Not really, if the values sniffed are all null and we did not sniff the full dataframe, it now defaults to a varchar.

Tishj commented 2 weeks ago

That just means that this does not break the referenced issue because the columns happen to be VARCHAR, that's not equivalent

I'm fine merging this, I agree that this regression is a problem, but we do have to be aware of the behavior we're throwing away here and the problems that will inevitably cause

Mytherin commented 2 weeks ago

Maybe there's a faster way of checking for NULL values instead of calling get_item for every row (which is very slow)?

Mytherin commented 2 weeks ago

e.g. maybe we could use https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html - although ideally we wouldn't do that for the entire DataFrame range

Mytherin commented 2 weeks ago

How about we do that as a fallback - if the result type is SQLNULL (i.e. we only found null values), we call isnull or similar to find any non-NULL values in the column (if there are any) and take their type instead.

pdet commented 2 weeks ago

How about we do that as a fallback - if the result type is SQLNULL (i.e. we only found null values), we call isnull or similar to find any non-NULL values in the column (if there are any) and take their type instead.

Sure, I'm happy to try/benchmark that :-)

pdet commented 2 weeks ago

@Mytherin I've modifed the code to essentially use __getitem__(first_valid_index()). The time on the benchmark when executing a query on top of the NYC Taxi dataframe is now: 0.22 So one order of magnitude faster than main, but one order of magnitude slower than just defaulting to varchar.

One thing to notice is that the query that is running does not require the columns that are object types. Maybe one thing we should consider doing is performing projection pushdown and the dataframe.

Mytherin commented 2 weeks ago

One thing to notice is that the query that is running does not require the columns that are object types. Maybe one thing we should consider doing is performing projection pushdown and the dataframe.

The problem is that what we really need in that case is some callback on when we need the type of the column, then we could lazily figure out the type only when required. That's not really supported infrastructure wise in the system right now - but would definitely be interesting to add.

pdet commented 1 week ago

@Mytherin The CI here is failing because the previous implementations has a bug on the benchmark I've added

Mytherin commented 1 week ago

Ah yeah, good point. LGTM in that case