delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

feat: add stats to convert-to-delta operation #2491

Closed gruuya closed 1 month ago

gruuya commented 1 month ago

Description

Collect stats during conversion of a parquet dir to a Delta table and add to the actions.

Related Issue(s)

Closes #2490

Documentation

github-actions[bot] commented 1 month ago

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

gruuya commented 1 month ago

I'm not sure why the parse decimal overflow appears in the Python tests.

It's indicative that I see the decimal column min/max ends up as Value(Number(2560.0))/Value(Number(3584.0)) even though the pyarrrow seems to be properly persisting the stats to the file

% parquet meta python/tests/part-0.parquet

File path:  python/tests/part-0.parquet
Created by: parquet-cpp-arrow version 15.0.2
Properties:
  ARROW:schema: /////5ADAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAA4AAAAwAwAA5AIAALACAAB8AgAASAIAABACAADgAQAAtAEAAIgBAABMAQAAHAEAAOgAAABkAAAABAAAABj9//8AAAEMFAAAABwAAAAEAAAAAQAAABQAAAAEAAAAbGlzdAAAAAAM/f//RP3//wAAAQIQAAAAGAAAAAQAAAAAAAAABAAAAGl0ZW0AAAAAfP3//wAAAAFAAAAAdP3//wAAAQ0YAAAAIAAAAAQAAAACAAAAPAAAABQAAAAGAAAAc3RydWN0AABs/f//pP3//wAAAQUQAAAAFAAAAAQAAAAAAAAAAQAAAHkAAACQ/f//yP3//wAAAQIQAAAAFAAAAAQAAAAAAAAAAQAAAHgAAAD8/f//AAAAAUAAAAD0/f//AAABChAAAAAcAAAABAAAAAAAAAAJAAAAdGltZXN0YW1wAAAA8v7//wAAAgAk/v//AAABCBAAAAAYAAAABAAAAAAAAAAGAAAAZGF0ZTMyAAAe////AAAAAFD+//8AAAEHEAAAACAAAAAEAAAAAAAAAAcAAABkZWNpbWFsAAgADAAEAAgACAAAAAUAAAADAAAAiP7//wAAAQQQAAAAGAAAAAQAAAAAAAAABgAAAGJpbmFyeQAAeP7//7D+//8AAAEGEAAAABgAAAAEAAAAAAAAAAQAAABib29sAAAAAKD+///Y/v//AAABAxAAAAAYAAAABAAAAAAAAAAHAAAAZmxvYXQ2NADS////AAACAAT///8AAAEDEAAAACAAAAAEAAAAAAAAAAcAAABmbG9hdDMyAAAABgAIAAYABgAAAAAAAQA4////AAABAhAAAAAYAAAABAAAAAAAAAAEAAAAaW50OAAAAABw////AAAAAQgAAABo////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MTYAAACg////AAAAARAAAACY////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MzIAAADQ////AAAAASAAAADI////AAABAhAAAAAgAAAABAAAAAAAAAAFAAAAaW50NjQAAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAEAAAAdXRmOAAAAAAEAAQABAAAAAAAAAA=
Schema:
message schema {
  optional binary utf8 (STRING);
...
  optional fixed_len_byte_array(3) decimal (DECIMAL(5,3));
...

Row group 0:  count: 5  288.20 B records  start: 4  total(compressed): 1.407 kB total(uncompressed):1.408 kB
--------------------------------------------------------------------------------
                   type      encodings count     avg size   nulls   min / max
utf8               BINARY    S _ R     5         16.00 B    0       "0" / "4"
...
decimal            FIXED[3] S _ R     5         17.00 B  0       "10.000" / "14.000"
...

EDIT: I think this was caused by an unidentified bug in the statistics parser that is resolved by 5972aab07723fe11243c017f1938b96b70d45810

gruuya commented 1 month ago

Hi @roeap, @ion-elgreco, I think this is ready for review. There's both a feat added (convert stats) as well as a bugfix (decimal stats parsing from byte arrays) now.

Can you provide some feedback on it? Thanks!

ion-elgreco commented 1 month ago

Thankss

ion-elgreco commented 1 month ago

@gruuya can you rebase so that we can merge

gruuya commented 1 month ago

@ion-elgreco weirdly a couple of tests seem to be stuck, and have been running close to 2h hours now (test_concurrency_local, test_integration_local). I presume this is a CI problem of some sort (the previous runs were ok), can you restart the job please?

ion-elgreco commented 1 month ago

@gruuya yeah it's a flaky test, re-executed it