Open progval opened 7 months ago
I suspect this is probably an issue with sign extension, as parquet doesn't natively support these types and they must be widened to int32
I started looking into this. Hoping to have a PR open in the next few days. :)
As @tustvold mentioned, this appears to be caused by widening Int8
and Int16
columns to Int32
when writing, then later checking the bloom filter with the unwidened type, resulting in different hashes.
I think a general solution would be to apply the same Arrow-to-Parquet type coercion that occurs on write prior to checking the bloom filter.
The general issue is that Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Thus, bloom filter checks might produce false negatives if passing the original Arrow representation. Correctness is only guaranteed when checking with the coerced Parquet value.
For the issue mentioned above, Parquet only supports Int32
and Int64
integer types:
16-bit ints are not explicitly supported in the storage format since they are covered by 32-bit ints with an efficient encoding
Int8
and Int16
Arrow types are widened to Int32
when writing. For correct bloom filter behavior, they must be widened prior to calling Sbbf::check
. Widening before checking causes the tests mentioned above to pass.
However, this issue is not limited to Int8
and Int16
. For example, when writing Date64
, values are coerced to Date32
and then to Int32
. This changes the physical representation from Int64
"milliseconds since epoch" to Int32
"days since epoch".
#[test]
fn date64_column_bloom_filter() {
use chrono::NaiveDate;
// construct input data
let epoch = NaiveDate::from_ymd_opt(1970, 1, 1).unwrap();
let dates = vec![
NaiveDate::from_ymd_opt(1915, 11, 25).unwrap(),
NaiveDate::from_ymd_opt(1964, 4, 14).unwrap(),
NaiveDate::from_ymd_opt(1992, 9, 2).unwrap(),
NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
];
let values: Vec<_> = dates
.iter()
.map(|d| d.signed_duration_since(epoch).num_milliseconds())
.collect();
// write (asserts written values match input)
let array = Arc::new(Date64Array::from(values.clone()));
let mut options = RoundTripOptions::new(array, false);
options.bloom_filter = true;
let files = one_column_roundtrip_with_options(options);
// checks bloom filter with original input
// fails with message: Value [0, 100, 252, 121, 114, 254, 255, 255] should be in bloom filter
check_bloom_filter(
files,
"col".to_string(),
values,
(0..SMALL_SIZE as i64).collect(),
);
// convert to "days since epoch"
let values: Vec<_> = dates
.iter()
.map(|d| d.signed_duration_since(epoch).num_days() as i32)
.collect();
// passes
check_bloom_filter(
files,
"col".to_string(),
values,
(0..SMALL_SIZE as i64).collect(),
);
}
I think the same would apply to decimal and interval types.
This only affects scenarios where Parquet was written via ArrowWriter
. The solution should not impact scenarios where the Parquet reader/writer APIs are used directly.
Since the SBBF is checked by the crate user, part of the solution involves documenting how the Arrow-Parquet type coercion affects the physical representation and, therefore, the bloom filter. However, I think we can make some changes to improve the safety/convenience of working with bloom filters. Some options:
arrow_value_to_parquet_value
function for callers to use prior to calling Sbbf::check
. arrow_sbbf_check
, that accepts an Arrow array and bloom filter. This method would map the Arrow type to the correct Parquet type before checking the bloom filter, returning a BooleanArray
.Sbbf
API directly. There is a similar conversation happening regarding the C++ implementation.The first two options are simple but still push some complexity to Arrow users. I think the last option is ideal but definitely a much larger effort. But, these options aren't mutually exclusive. Would love to hear other thoughts on this. :)
FWIW it appears that the same issue applies to unsigned u8
and u16
that https://github.com/apache/arrow-datafusion/pull/9770
Thank you for the analysis @mr-brobot. I think your suggestions 1 or 2 in https://github.com/apache/arrow-rs/issues/5550#issuecomment-2028482716 sound reasonable to me (as that function would be needed even if we were to implement 3.
cc @Ted-Jiang
Ok, I'll move forward with options 1 and 2. Will let folks decide if we keep both or just one in the PR discussion. 🙂
I'll also open a separate issue for option 3. I think that topic deserves a dedicated conversation thread.
I'm traveling this week but should have time to make progress this weekend. 🙂
Describe the bug
It is unclear to me if this is an issue when building or checking the Bloom filter; but either way, building a Bloom filter with
i8
ori16
values (as opposed toi32
ori64
) always returnsfalse
when checked.To Reproduce
returns:
Expected behavior These tests should pass
Additional context
I found this from Datafusion: https://github.com/apache/arrow-datafusion/issues/9779