apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.33k stars 684 forks source link

Decimal enhancements in arrow-cast #5068

Open ryanaston opened 7 months ago

ryanaston commented 7 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. There are currently two different functions in arrow-cast which convert strings to decimals. parse_string_to_decimal_native is used by arrow-cast when casting an entire array from a string type to a decimal type. parse_decimal is used by arrow-csv and arrow-json. The inputs differ a bit, but the return types are identical. The latter accepts an additional precision argument, however when the former is used it is followed up by a call to validate_decimal_precision for the desired type. The former rounds values when input exceeds scale while the latter truncates. Neither function supports parsing scientific notation.

Describe the solution you'd like

  1. Consolidate into a single, consistent function
  2. Support scientific notation input
  3. Default to "half up" rounding
  4. Stretch goal: allow for configurable rounding (maybe as a CastOption?)

Describe alternatives you've considered N/A

Additional context I stumbled upon this when using arrow-json reader to build a record batch. I was getting errors when my decimal values contained scientific notation and noticed parse_decimal doesn't support it, despite scientific notation being valid for JSON numbers. When I tried forcing expanded notation I found the excess scale was being truncated instead of rounded. I then tried building my column using a string array and casting that to a decimal type. Instead of erroring on scientific notation the value was being treated as null. Expanded values did work and were rounded.

tustvold commented 7 months ago

Decimals are rounded when casting between scales, so one option might be to parse using 1 higher scale and then cast. This might be simpler than supporting rounding whilst parsing, which is likely to be complex to achieve without regressing performance.

Instead of erroring on scientific notation the value was being treated as null

Correct, if CastOptions::safe is set to true (the default) values that fail to parse will be treated as nulls

ryanaston commented 7 months ago

Decimals are rounded when casting between scales, so one option might be to parse using 1 higher scale and then cast. This might be simpler than supporting rounding whilst parsing, which is likely to be complex to achieve without regressing performance.

parse_decimal already looks at every character in the string, even those beyond the desired scale to make sure they are valid characters for a decimal. Could it simply examine the character at scale + 1 and do an add_wrapping (for positives) or a sub_wrapping (for negatives)? Rounding is already present in parse_string_to_decimal_native and does it this way. Why would rounding be a performance concern in one but not the other?

Instead of erroring on scientific notation the value was being treated as null

Correct, if CastOptions::safe is set to true (the default) values that fail to parse will be treated as nulls

Ah, missed that. Thank you!

tustvold commented 7 months ago

Could it simply examine the character at scale + 1 and do an add_wrapping (for positives) or a sub_wrapping (for negatives)

My memory is hazy, but I seem to remember the current logic eliminates the need for overflow/underflow detection, or precision verification. Provided we can show we aren't massively regressing performance of decimal parsing, I don't see an issue with changing this behaviour as you describe.