delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.18k stars 1.62k forks source link

Feature Request - Auto Analyze Table #581

Open devlucasc opened 3 years ago

devlucasc commented 3 years ago

It is possible to enable an option to auto analyze a delta table?

For example: ANALYZE TABLE x COMPUTE STATISTICS FOR ALL COLUMNS

rubenssoto commented 3 years ago

It would be very good if delta lake OSS have columns statistics for parquet file prunning.

scottsand-db commented 2 years ago

We are currently reviewing this issue and will follow up shortly.

felipepessoto commented 2 years ago

@scottsand-db do you have any updates on this? Is it expected for the next release? Thanks

scottsand-db commented 2 years ago

Hi @felipepessoto, thanks for following up. Delta Lake 1.1 included per-column file stats collection + data skipping. Does this meet your needs?

felipepessoto commented 2 years ago

I need to test it. In my experiments with Parquet and Delta, the ANALYZE TABLE made the queries ~40% faster than both Parquet (without ANALYZE TABLE) and Delta.

felipepessoto commented 2 years ago

BTW, you mean Delta 1.2? In 1.1 changelog I don't see these changes

scottsand-db commented 2 years ago

Yup, my bad. I meant 1.2.

felipepessoto commented 2 years ago

@scottsand-db in my test with 1.2 it didn't improve performance. Looking the query plan, they are the same as 1.1, except by PreparedDeltaFileIndex instead of TahoeLogFileIndex.

Stats are expected to improve performance for queries like this? https://github.com/Agirish/tpcds/blob/master/query93.sql (with inner queries, joins)

zsxwing commented 2 years ago

@felipepessoto did you re-generate the data for your tests? Stats are only written by Delta 1.2, so you would need to re-generate them in order to leverage the data skipping improvements added by Delta 1.2.

felipepessoto commented 2 years ago

UPDATE: I found the stats (min, max, null count) in delta log, but not sure why they are not being used during query

Yes, I regenerate it. Do you know how I can check the stats?

DESCRIBE EXTENDED doesn't show any:

image

If I do the same using parquet, after running the ANALYZE Table:

image

BTW, do you know the differences between both approaches? Delta stats and ANALYZE TABLE?

Thanks

scottsand-db commented 2 years ago

Hi @felipepessoto. The main advantage of ANALYZE is that it computes different stats than "column stats + data skipping". It computes stats on unique/distinct values in a table. These stats are useful in improving the performance of JOINs.

Column stats on the other hand are used to help skip files during scans. So, if you perform a filter and then a JOIN, that is when column stats will help.

The reason Delta Lake currently doesn't support ANALYZE is that it currently looks directly at parquet files. We would instead need to read the transaction log, and use that metadata to tell us which parquet files to read.

If I'm missing any details I'm sure that @zsxwing can fill them in.

felipepessoto commented 2 years ago

Some more questions, please:

  1. Would be correct to say that Delta stats are file-wise while ANALYZE are table-wise?
  2. ANALYZE is a Spark feature, while Delta stats is part of Delta spec?
  3. There is any work in progress to support Delta ANALYZE?

I'm wondering how much effort would be to change ANALYZE to work with Delta, reading from transaction log like you said Or alternatively, change Delta specs to add the other stats (if it makes sense).

zsxwing commented 2 years ago
  1. Would be correct to say that Delta stats are file-wise while ANALYZE are table-wise?

Yep.

2. ANALYZE is a Spark feature, while Delta stats is part of Delta spec?

Yep. ANALYZE is a Spark feature and technically it only supports built-in file formats. Today Spark doesn't provide an interface to non built-in data sources. In addition, it requires users to run ANALYZE by themselves, and it's easy to return incorrect answers when users forgot to run ANALYZE.

3. There is any work in progress to support Delta ANALYZE?

IIRC, you are basically asking whether Delta can support table-wise stats. If so, this is not on our roadmap. This won't be an easy project. Unlike parquet, Delta Lake needs to provide ACID, and updating the table-wise stats for each write with ACID guarantee is challenging.

An alternative solution but no ideal is computing the table-wise stats based on the file-wise stats when reading a Delta table. It would take a bit time to compute the table-wise stats and won't be as fast as parquet. But it would provide ACID which is critical for Delta Lake users. This is just a brainstorm. Feel free to post your suggestions!

felipepessoto commented 1 year ago

@zsxwing, I was looking at this PR: https://github.com/delta-io/delta/pull/840 which fallback to v1 to implement ANALYZE TABLE.

It seems the problem with analyzeTable is only to calculate the total size, because the calculateTotalSize method relies on catalogTable.storage.locationUri and scan everything inside it (or similarly when data is partitioned).

For count it seems fine to me:

sparkSession.table(tableIdentWithDB).count()

And computeColumnStats also looks good.

Do you think the same?

zsxwing commented 1 year ago

@felipepessoto yep. You are right. Do you know how Spark uses these stats? Only for optimization, or also use stats to return answers?

felipepessoto commented 1 year ago

From my experiment, it seems it is only for optimization. I executed a SELECT COUNT(*) FROM TABLE query, Spark still scan the parquet files, instead of using the data available in the stats to return the results. The same for MAX/MIN

felipepessoto commented 1 year ago

https://github.com/apache/spark/pull/41111