apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
477 stars 97 forks source link

feat: Project transform #309

Closed marvinlanhenke closed 3 months ago

marvinlanhenke commented 3 months ago

Which issue does this PR close?

Closes #264

Rationale for this change

The ability to project row_filter to Transform partition. This will unblock the Manifest & PartitionEvaluator, which will enable the pruning of manifest files in fn plan_files.

What changes are included in this PR?

Are these changes tested?

Yes. Unit tests are included.

marvinlanhenke commented 3 months ago

@liurenjie1024 @ZENOTME @sdd PTAL

marvinlanhenke commented 3 months ago

LGTM, with the caveat that I'm not an expert on this part of the spec, but the PR seems well-structured.

I wouldn't mind seeing more comments in the tests though to explain the "why" aspect of each test.

Thanks for your feedback. If the others approve "the correctness" of the PR - I'll add those comments.

marvinlanhenke commented 3 months ago

Thanks for picking this up @marvinlanhenke

The most important part of adding projection is making that they are absolutely correct. If rust would generate something different than Java of Python, leads to data incorrectness since it would not correctly evaluate the partition predicates.

@Fokko Thanks for the extensive review. Not only did I miss some test but the complete implementation for Dates : https://github.com/apache/iceberg/blob/d350c9b8c995a2953aa8b80a0a1fc7cadc4dd16a/api/src/main/java/org/apache/iceberg/transforms/Dates.java#L123 // I was just looking for the transforms for Years, Months, etc. So thank you for hinting me at this - I'll implement this as well - and try to port the Java Testsuite.

marvinlanhenke commented 3 months ago

@Fokko Is it correct to assume, as far as I understood the Java implementation, that we support dates projection only for year, month, and day? Also adjusting the boundary only works with integer days? I'm asking because in my current implementation I only adjust the boundaries for PrimitiveLiteral::Date(i32) and have no support for PrimitiveLiteral::Time(i64 etc.

@liurenjie1024 I converted this to a draft - since I had to change the design. I'll push this draft, so you can verify. Basically, I got rid of the trait and implemented boundary on Datum itself - which makes more sense to me, now that I have a better overall picture with the dates transformations in mind.

Once, we agree on the overall implementation - I'll start porting the testsuite.

liurenjie1024 commented 3 months ago

Basically, I got rid of the trait and implemented boundary on Datum itself - which makes more sense to me, now that I have a better overall picture with the dates transformations in mind.

@marvinlanhenke Thanks for this, I also feel that an boundary trait is a little over design, and implementing them directly on Datum seems better to me.

marvinlanhenke commented 3 months ago

@liurenjie1024 I think I covered most of your suggestions. PTAL if the overall design and implementation in general is fine?

Unresolved Issues / Todos:

(*) I'd be interested in the approach here? simply put a helper fn into the corresponding module e.g. fn project_bucket_transform into bucket.rs / or as an associated function?


*due to easter-holidays (and my two children) - I can probably do only some minimal work on this over the weekend

liurenjie1024 commented 3 months ago

Hi, @marvinlanhenke Thanks for your contribution, the overall design looks good to me! Please take your time and enjoy time with family😁

marvinlanhenke commented 3 months ago

@Fokko @liurenjie1024 I ported most of the test (only timestamp_hours are missing) and the projection seems to be handled correctly. However, in some cases for the timestamp transforms I had trouble verifying the results.

So, I want through the (painful :P) process of setting up the Java repo and debugged some of the deviating tests. I'm pretty sure now, that we handle the timestamp conversion into days differently than the java implementation, which leads to different results.

For example:

Long date = (long) Literal.of("1969-01-01T00:00:00.00000").to(TYPE).value();

evaluates to -31_536_000_000_000.

With PredicateOperator GT the boundary is adjusted by +1L which results in -31_535_999_999_999.

Up until this point everything is fine in Java as well as in our Rust implementation. Then the Transform::Day is applied. Java results in -365, whereas Rust outputs -364 days. From what I've seen the issue is here, when we downcast the result from f64 to i32 and we're off by one.

    let before:i64 = -31_535_999_999_999;
    let after = (before as f64) / 1000.0 / 1000.0 * DAY_PER_SECOND;
    // -364.9999999999884
    // -364 as i32

I guess, this is what I have to understand now? (...and probably create another issue)

marvinlanhenke commented 3 months ago

@Fokko @liurenjie1024 PTAL

I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation. I think we only can optimize the design/ structure by perhaps moving the test-suite to the respective transforms (e.g. bucket.rs, etc)? Other than that I think we are good to go?

liurenjie1024 commented 3 months ago

@Fokko @liurenjie1024 PTAL

I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation. I think we only can optimize the design/ structure by perhaps moving the test-suite to the respective transforms (e.g. bucket.rs, etc)? Other than that I think we are good to go?

Hi, @marvinlanhenke Thanks, I'll take a careful look of this pr. About moving the specific logic into respective transforms, do you plan to do it in this pr or in following prs?

marvinlanhenke commented 3 months ago

@Fokko @liurenjie1024 PTAL I ported all the tests - and fixed #311. Now, all tests are passing and align with the Java implementation. I think we only can optimize the design/ structure by perhaps moving the test-suite to the respective transforms (e.g. bucket.rs, etc)? Other than that I think we are good to go?

Hi, @marvinlanhenke Thanks, I'll take a careful look of this pr. About moving the specific logic into respective transforms, do you plan to do it in this pr or in following prs?

... I haven't made up my mind yet. I think we can merge this if its okay, in order to unblock #253?

Also I'm not so sure anymore we have to move the logic at all? Since most of the helper functions handle logic that is not dependent of the type of transform? Perhaps you can outline the refactor high-level what you have in mind?

liurenjie1024 commented 3 months ago

Hi, @marvinlanhenke I skimmed through the code and it seems that we only need to split tests into separate modules to make it easier to maintain and read. But I agree that we can merge it first to unblock following features, so it's up to you.

marvinlanhenke commented 3 months ago

Hi, @marvinlanhenke I skimmed through the code and it seems that we only need to split tests into separate modules to make it easier to maintain and read. But I agree that we can merge it first to unblock following features, so it's up to you.

Take your time while reviewing in the meantime - I'll open another draft with a possible refactor and we can compare whats best? I 'm thinking of splitting the code into Identity, Bucket, Truncate and Temporal logic - including the tests. This however might introduce some minor code duplication (I'll have to try and see for myself though, now that we have all the tests - it should be much easier to verify)

EDIT: @liurenjie1024 I've taken a look again and I don't think a refactor makes sense here (except moving the tests). Since so many transforms are handled the same way, implementing project on each transform would lead to code duplication, and a bigger maintenance burden (also I'd have to create a single match arm for each enum variant). So I think the approach with the helper functions on Transform itself is fine, as it allows for code locallity and an overall more generic approach (combining the same logic).

marvinlanhenke commented 3 months ago

Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests!

Thanks for the review, I'll get to your suggestions - those should be easy to fix.

liurenjie1024 commented 3 months ago

cc @Fokko Do you have other comments?

marvinlanhenke commented 3 months ago

Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests!

@liurenjie1024 @Fokko Thanks for the review - I did some minor fixes according to your suggestions. If no other comments, I think we're good to go.