apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.57k stars 779 forks source link

Bloom filters for i8 and i16 always return false negatives #5550

Open progval opened 7 months ago

progval commented 7 months ago

Describe the bug

It is unclear to me if this is an issue when building or checking the Bloom filter; but either way, building a Bloom filter with i8 or i16 values (as opposed to i32 or i64) always returns false when checked.

To Reproduce

diff --git a/parquet/src/arrow/arrow_writer/mod.rs b/parquet/src/arrow/arrow_writer/mod.rs
index 18c8617e07..d6b14e2899 100644
--- a/parquet/src/arrow/arrow_writer/mod.rs
+++ b/parquet/src/arrow/arrow_writer/mod.rs
@@ -2039,6 +2039,36 @@ mod tests {
         values_required::<BinaryArray, _>(many_vecs_iter);
     }

+    #[test]
+    fn i8_column_bloom_filter() {
+        let array = Arc::new(Int8Array::from_iter(0..SMALL_SIZE as i8));
+        let mut options = RoundTripOptions::new(array, false);
+        options.bloom_filter = true;
+
+        let files = one_column_roundtrip_with_options(options);
+        check_bloom_filter(
+            files,
+            "col".to_string(),
+            (0..SMALL_SIZE as i8).collect(),
+            (SMALL_SIZE as i8 + 1..SMALL_SIZE as i8 + 10).collect(),
+        );
+    }
+
+    #[test]
+    fn i16_column_bloom_filter() {
+        let array = Arc::new(Int16Array::from_iter(0..SMALL_SIZE as i16));
+        let mut options = RoundTripOptions::new(array, false);
+        options.bloom_filter = true;
+
+        let files = one_column_roundtrip_with_options(options);
+        check_bloom_filter(
+            files,
+            "col".to_string(),
+            (0..SMALL_SIZE as i16).collect(),
+            (SMALL_SIZE as i16 + 1..SMALL_SIZE as i16 + 10).collect(),
+        );
+    }
+
     #[test]
     fn i32_column_bloom_filter() {
         let array = Arc::new(Int32Array::from_iter(0..SMALL_SIZE as i32));
@@ -2054,6 +2084,21 @@ mod tests {
         );
     }

+    #[test]
+    fn i64_column_bloom_filter() {
+        let array = Arc::new(Int64Array::from_iter(0..SMALL_SIZE as i64));
+        let mut options = RoundTripOptions::new(array, false);
+        options.bloom_filter = true;
+
+        let files = one_column_roundtrip_with_options(options);
+        check_bloom_filter(
+            files,
+            "col".to_string(),
+            (0..SMALL_SIZE as i64).collect(),
+            (SMALL_SIZE as i64 + 1..SMALL_SIZE as i64 + 10).collect(),
+        );
+    }
+
     #[test]
     fn binary_column_bloom_filter() {
         let one_vec: Vec<u8> = (0..SMALL_SIZE as u8).collect();

returns:

failures:

---- arrow::arrow_writer::tests::i16_column_bloom_filter stdout ----
thread 'arrow::arrow_writer::tests::i16_column_bloom_filter' panicked at parquet/src/arrow/arrow_writer/mod.rs:1792:17:
Value [0, 0] should be in bloom filter

---- arrow::arrow_writer::tests::i8_column_bloom_filter stdout ----
thread 'arrow::arrow_writer::tests::i8_column_bloom_filter' panicked at parquet/src/arrow/arrow_writer/mod.rs:1792:17:
Value [0] should be in bloom filter

failures:
    arrow::arrow_writer::tests::i16_column_bloom_filter
    arrow::arrow_writer::tests::i8_column_bloom_filter

Expected behavior These tests should pass

Additional context

I found this from Datafusion: https://github.com/apache/arrow-datafusion/issues/9779

tustvold commented 7 months ago

I suspect this is probably an issue with sign extension, as parquet doesn't natively support these types and they must be widened to int32

mr-brobot commented 7 months ago

I started looking into this. Hoping to have a PR open in the next few days. :)

mr-brobot commented 7 months ago

As @tustvold mentioned, this appears to be caused by widening Int8 and Int16 columns to Int32 when writing, then later checking the bloom filter with the unwidened type, resulting in different hashes.

I think a general solution would be to apply the same Arrow-to-Parquet type coercion that occurs on write prior to checking the bloom filter.

mr-brobot commented 7 months ago

Overview

The general issue is that Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Thus, bloom filter checks might produce false negatives if passing the original Arrow representation. Correctness is only guaranteed when checking with the coerced Parquet value.

For the issue mentioned above, Parquet only supports Int32 and Int64 integer types:

16-bit ints are not explicitly supported in the storage format since they are covered by 32-bit ints with an efficient encoding

Int8 and Int16 Arrow types are widened to Int32 when writing. For correct bloom filter behavior, they must be widened prior to calling Sbbf::check. Widening before checking causes the tests mentioned above to pass.

However, this issue is not limited to Int8 and Int16. For example, when writing Date64, values are coerced to Date32 and then to Int32. This changes the physical representation from Int64 "milliseconds since epoch" to Int32 "days since epoch".

#[test]
fn date64_column_bloom_filter() {
    use chrono::NaiveDate;

    // construct input data
    let epoch = NaiveDate::from_ymd_opt(1970, 1, 1).unwrap();
    let dates = vec![
        NaiveDate::from_ymd_opt(1915, 11, 25).unwrap(),
        NaiveDate::from_ymd_opt(1964, 4, 14).unwrap(),
        NaiveDate::from_ymd_opt(1992, 9, 2).unwrap(),
        NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
    ];
    let values: Vec<_> = dates
        .iter()
        .map(|d| d.signed_duration_since(epoch).num_milliseconds())
        .collect();

    // write (asserts written values match input)
    let array = Arc::new(Date64Array::from(values.clone()));
    let mut options = RoundTripOptions::new(array, false);
    options.bloom_filter = true;
    let files = one_column_roundtrip_with_options(options);

    // checks bloom filter with original input
    // fails with message: Value [0, 100, 252, 121, 114, 254, 255, 255] should be in bloom filter        
    check_bloom_filter(
        files,
        "col".to_string(),
        values,
        (0..SMALL_SIZE as i64).collect(),
    );

    // convert to "days since epoch"
    let values: Vec<_> = dates
        .iter()
        .map(|d| d.signed_duration_since(epoch).num_days() as i32)
        .collect();

    // passes
    check_bloom_filter(
        files,
        "col".to_string(),
        values,
        (0..SMALL_SIZE as i64).collect(),
    );
}

I think the same would apply to decimal and interval types.

Solution

This only affects scenarios where Parquet was written via ArrowWriter. The solution should not impact scenarios where the Parquet reader/writer APIs are used directly.

Since the SBBF is checked by the crate user, part of the solution involves documenting how the Arrow-Parquet type coercion affects the physical representation and, therefore, the bloom filter. However, I think we can make some changes to improve the safety/convenience of working with bloom filters. Some options:

  1. Introduce an arrow_value_to_parquet_value function for callers to use prior to calling Sbbf::check.
  2. Introduce a new function, perhaps named arrow_sbbf_check, that accepts an Arrow array and bloom filter. This method would map the Arrow type to the correct Parquet type before checking the bloom filter, returning a BooleanArray.
  3. Optionally use the bloom filter in the Arrow reader. This can perform the same type coercion that occurs in Arrow writer. This way, Arrow users never need to use the Sbbf API directly. There is a similar conversation happening regarding the C++ implementation.

The first two options are simple but still push some complexity to Arrow users. I think the last option is ideal but definitely a much larger effort. But, these options aren't mutually exclusive. Would love to hear other thoughts on this. :)

alamb commented 7 months ago

FWIW it appears that the same issue applies to unsigned u8 and u16 that https://github.com/apache/arrow-datafusion/pull/9770

alamb commented 7 months ago

Thank you for the analysis @mr-brobot. I think your suggestions 1 or 2 in https://github.com/apache/arrow-rs/issues/5550#issuecomment-2028482716 sound reasonable to me (as that function would be needed even if we were to implement 3.

cc @Ted-Jiang

mr-brobot commented 7 months ago

Ok, I'll move forward with options 1 and 2. Will let folks decide if we keep both or just one in the PR discussion. 🙂

I'll also open a separate issue for option 3. I think that topic deserves a dedicated conversation thread.

I'm traveling this week but should have time to make progress this weekend. 🙂