databrickslabs / dbignite

Other
22 stars 11 forks source link

Bump delta-spark from 2.1.1 to 2.2.0 #18

Closed dependabot[bot] closed 1 year ago

dependabot[bot] commented 1 year ago

Bumps delta-spark from 2.1.1 to 2.2.0.

Release notes

Sourced from delta-spark's releases.

Delta Lake 2.2.0

We are excited to announce the release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features in this release are as follows:

  • LIMIT pushdown into Delta scan. Improve the performance of queries containing LIMIT clauses by pushing down the LIMIT into Delta scan during query planning. Delta scan uses the LIMIT and the file-level row counts to reduce the number of files scanned which helps the queries read far less number of files and could make LIMIT queries faster by 10-100x depending upon the table size.

  • Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as SELECT COUNT(*) on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x.

  • Support for collecting file level statistics as part of the CONVERT TO DELTA command. These statistics potentially help speed up queries on the Delta table. By default the statistics are collected now as part of the CONVERT TO DELTA command. In order to disable statistics collection specify NO STATISTICS clause in the command. Example: CONVERT TO DELTA table_name NO STATISTICS

  • Improve performance of the DELETE command by pruning the columns to read when searching for files to rewrite.

  • Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from commitTime to expireTime. If you already have TTL enabled, please follow the migration steps here.

  • Fix non-deterministic behavior during MERGE when working with sources that are non-deterministic.

  • Remove the restrictions for using Delta tables with column mapping in certain Streaming + CDF cases. Earlier we used to block Streaming+CDF if the Delta table has column mapping enabled even though it doesn’t contain any RENAME or DROP columns.

  • Other notable changes

    • Improve the monitoring of the Delta state construction queries (additional queries run as part of planning) by making them visible in the Spark UI.
    • Support for multiple where() calls in Optimize scala/python API
    • Support for passing Hadoop configurations via DeltaTable API
    • Support partition column names starting with . or _ in CONVERT TO DELTA command.
    • Improvements to metrics in table history
    • Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
    • Fix a bug in MERGE INTO when there are multiple UPDATE clauses and one of the UPDATEs is with a schema evolution.
    • Fix a bug where sometimes active SparkSession object is not found when using Delta APIs
    • Fix an issue where partition schema couldn’t be set during the initial commit.
    • Catch exceptions when writing last_checkpoint file fails.
    • Fix an issue when restarting a streaming query with AvailableNow trigger on a Delta table.
    • Fix an issue with CDF and Streaming where the offset is not correctly updated when there are no data changes.

Credits Abhishek Somani, Adam Binford, Allison Portis, Amir Mor, Andreas Chatzistergiou, Anish Shrigondekar, Carl Fu, Carlos Peña ,Chen Shuai, Christos Stavrakakis, Eric Maynard, Fabian Paul, Felipe Pessoto, Fredrik Klauss, Ganesh Chand, Hedi Bejaoui, Helge Brügner, Hussein Nagree, Ionut Boicu, Jackie Zhang, Jiaheng Tang, Jintao Shen, Jintian Liang, Joe Harris, Johan Lasperas, Jonas Irgens Kylling, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kam Cheung Ting, Karthik Subramanian, Kevin Neville, Lars Kroll, Lin Ma, Linhong Liu, Lukas Rupprecht, Max Gekk, Ming Dai, Mingliang Zhu, Nick Karpov, Ole Sasse, Paddy Xu, Patrick Marx, Prakhar Jain, Pranav, Rajesh Parangi, Ronald Zhang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Supun Nakandala, Thang Long Vu, Tom van Bussel, Tyson Condie, Venki Korukanti, Vitalii Li, Weitao Wen, Wenchen Fan, Xinyi, Yuming Wang, Zach Schuermann, Zainab Lawal, sherlockbeard (github id)

Delta Lake 2.2.0

We are excited to announce the preview release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

... (truncated)

Commits
  • 9bd46b4 Setting version to 2.2.0
  • 01a9739 Setting version to 2.2.0rc1
  • 35ff18e More changes to not publish the delta-iceberg module
  • fb037ec Upgrade version in integration tests
  • a87097c disable publishing the delta-iceberg artifact
  • a5fcec4 1385 - Collect statistics by default in ConvertToDelta & Update SQL API
  • 406e225 Update Delta version to 2.1
  • add6896 Minor formatting change
  • 80b1224 Minor refactoring to RoaringBitmapArraySuite.
  • b3ff96c Minor refactoring
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
codecov[bot] commented 1 year ago

Codecov Report

Merging #18 (5de2081) into main (c0c6541) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##             main      #18   +/-   ##
=======================================
  Coverage   91.78%   91.78%           
=======================================
  Files           4        4           
  Lines         219      219           
=======================================
  Hits          201      201           
  Misses         18       18           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.