datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
28 stars 8 forks source link

Avoid adding a NullBuffer when decoding timestamp offsets #90

Closed progval closed 1 month ago

progval commented 1 month ago

Since 60288cdee57f72dec84c3fd9f6085561568aad49 we applied an unary_opt kernel to decode timezones. This kernel always returns Some unless the date is outside the 1677-2262.

Unfortunately, even though the kernel is unlikely to return None, applying the kernel always causes the resulting array to get a NullBuffer, even if the source array did not have one.

In order to avoid unnecessarily adding a NullBuffer, this commit first tries to apply a non-nullable kernel; and only falls back to unary_opt in the rare case it fails.

An alternative implementation that does not risk running the kernel twice would be to check the NullBuffer's null_count after running the kernel then strip it if its null_count is zero; but it requires the unnecessary allocation of a NullBuffer.