datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
30 stars 8 forks source link

refactor: refactor byte/boolean iter #29

Closed WenyXu closed 7 months ago

WenyXu commented 7 months ago

I'm trying to implement the list datatype. However, I found that our present iterator relies on the number_of_rows of stripe, which made the nested lists also rely on it.

codecov[bot] commented 7 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (feb932c) 76.32% compared to head (d438c1d) 76.02%. Report is 3 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #29 +/- ## ========================================== - Coverage 76.32% 76.02% -0.31% ========================================== Files 32 32 Lines 3215 3124 -91 ========================================== - Hits 2454 2375 -79 + Misses 761 749 -12 ``` | [Flag](https://app.codecov.io/gh/datafusion-contrib/datafusion-orc/pull/29/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datafusion-contrib) | Coverage Δ | | |---|---|---| | [rust](https://app.codecov.io/gh/datafusion-contrib/datafusion-orc/pull/29/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datafusion-contrib) | `76.02% <97.05%> (-0.31%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datafusion-contrib#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/datafusion-contrib/datafusion-orc/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datafusion-contrib) | Coverage Δ | | |---|---|---| | [src/arrow\_reader/column/boolean.rs](https://app.codecov.io/gh/datafusion-contrib/datafusion-orc/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datafusion-contrib#diff-c3JjL2Fycm93X3JlYWRlci9jb2x1bW4vYm9vbGVhbi5ycw==) | `100.00% <100.00%> (ø)` | | | [src/arrow\_reader/column/present.rs](https://app.codecov.io/gh/datafusion-contrib/datafusion-orc/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datafusion-contrib#diff-c3JjL2Fycm93X3JlYWRlci9jb2x1bW4vcHJlc2VudC5ycw==) | `100.00% <100.00%> (ø)` | | | [src/arrow\_reader/column/tinyint.rs](https://app.codecov.io/gh/datafusion-contrib/datafusion-orc/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datafusion-contrib#diff-c3JjL2Fycm93X3JlYWRlci9jb2x1bW4vdGlueWludC5ycw==) | `100.00% <100.00%> (ø)` | | | [src/reader/decode/byte\_rle.rs](https://app.codecov.io/gh/datafusion-contrib/datafusion-orc/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datafusion-contrib#diff-c3JjL3JlYWRlci9kZWNvZGUvYnl0ZV9ybGUucnM=) | `95.45% <100.00%> (+0.64%)` | :arrow_up: | | [src/reader/decode/boolean\_rle.rs](https://app.codecov.io/gh/datafusion-contrib/datafusion-orc/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datafusion-contrib#diff-c3JjL3JlYWRlci9kZWNvZGUvYm9vbGVhbl9ybGUucnM=) | `93.93% <96.15%> (+4.65%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

WenyXu commented 7 months ago

How does boolean iter work here if the no. of elements encoded within isn't a multiple of 8? Is there a check somewhere else?

It reads the next byte until it reaches the end of the buffer.

Jefffrey commented 7 months ago

How does boolean iter work here if the no. of elements encoded within isn't a multiple of 8? Is there a check somewhere else?

It reads the next byte until it reaches the end of the buffer.

So it'll emit extra false values for the padding bits that the caller will have to deal with, e.g. via take()?

WenyXu commented 7 months ago

How does boolean iter work here if the no. of elements encoded within isn't a multiple of 8? Is there a check somewhere else?

It reads the next byte until it reaches the end of the buffer.

So it'll emit extra false values for the padding bits that the caller will have to deal with, e.g. via take()?

Yes, The caller only calls next with the exact no. of rows times.