ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
120 stars 6 forks source link

29.02 cell cast to 29.020000000000003 string #289

Closed severinh closed 1 month ago

severinh commented 1 month ago

Steps to reproduce

See the Excel file decimal-numbers.xlsx. It contains a column with two decimals:

Decimals
28.14
29.02

Read the Excel file using fastexcel:

read_excel("decimal-numbers.xlsx").load_sheet_by_name("Sheet1").to_polars()
shape: (2, 1)
┌──────────┐
│ Decimals │
│ ---      │
│ f64      │
╞══════════╡
│ 28.14    │
│ 29.02    │
└──────────┘

Looks fine.

Then read the Excel file while casting to strings:

read_excel("decimal-numbers.xlsx").load_sheet_by_name("Sheet1", dtypes={0: 'string'}).to_polars()
shape: (2, 1)
┌────────────────────┐
│ Decimals           │
│ ---                │
│ str                │
╞════════════════════╡
│ 28.14              │
│ 29.020000000000003 │
└────────────────────┘

The expected result was that the strings would be the same as shown to the user in the Excel file:

shape: (2, 1)
┌──────────┐
│ Decimals │
│ ---      │
│ str      │
╞══════════╡
│ 28.14    │
│ 29.02    │
└──────────┘

Wrap-up

I understand this looks like an issue due to floating point precision, and I'm not sure if this:

  1. could be fixed in fastexcel
  2. could be fixed in the underlying calamine.
  3. cannot be fixed at all, since it's a fundamental property of the Excel file format or parsing process.

What's the motivation for filing this bug: In our system, we have highly heterogeneous data, so we have to read all Excel values as strings. However, if users see 29.02 in their Excel files, but 29.020000000000003 in our system, that's highly confusing and surprising to users.

What do you think?

PrettyWood commented 1 month ago

Hello @severinh Always great to have a clear bug issue with file and clear explanation thank you! I checked quickly and indeed it feels hard to fix on fastexcel side

In src/data.rs, in create_string_array function we simply do

cell.as_string()

Even with

cell.get_float().map(|v| v.to_string())

or

cell.as_f64()

we get the same behavior with this floating precision issue and AFAIK there is no way for us to get more info on the original input. I'll try to dig on calamine side tonight

EDIT: quick check on calamine

    inner: [
        SharedString(
            "Decimals",
        ),
        Float(
            28.14,
        ),
        Float(
            29.020000000000003,
        ),
    ],

when reading the xml content we get

BytesText { content: Borrowed("29.020000000000003") }

so I guess it goes even further directly in the xml content 😞

jmcnamara commented 1 month ago

The floating point number in the Excel file is 29.020000000000003:

    <row r="2" spans="1:1" x14ac:dyDescent="0.2">
      <c r="A2" s="1">
        <v>28.14</v>
      </c>
    </row>
    <row r="3" spans="1:1" x14ac:dyDescent="0.2">
      <c r="A3" s="1">
        <v>29.020000000000003</v>
      </c>
    </row>

For the display (and storage) of floating point numbers Excel uses something very similar to the printf() format option %.16G:

$ python
>>> print("%.16G" % 29.020000000000003)
29.02

That format option isn't supported by Rust in std::fmt however.

PrettyWood commented 1 month ago

Thanks @jmcnamara indeed it's really in the xml content itself.

Maybe we could use rust_decimal crate that implements already a "smart" Display

With

[dependencies]
rust_decimal = { version = "1.36.0", default-features = false }
use rust_decimal::prelude::*;

fn main() {
    let x = 29.020000000000003_f64;
    println!("x: {}", Decimal::from_f64(x).unwrap());
}
▶ cargo r -q
x: 29.02

WDYT?

jmcnamara commented 1 month ago

For the display (and storage) of floating point numbers Excel uses something very similar to the printf() format option %.16G:

I was wrong about this for the display part. Excel displays decimal numbers with 8 digits in a standard cell width and 10 digits in a wide cell. Like this:

Format = 0.000000000 Unformatted, wide cell Unformatted, standard width
1.123456789 1.123456789 1.123457
12.123456789 12.12345679 12.12346
123.123456789 123.1234568 123.1235
1234.123456789 1234.123457 1234.123
12345.123456789 12345.12346 12345.12
123456.123456789 123456.1235 123456.1

It also trims trailing zeros and the decimal point if there are no fractional part.

PrettyWood commented 1 month ago

Again thank you for the info. Much appreciated. I reckon we should make a PR on calamine side. I can maybe do that this weekend

severinh commented 1 month ago

Huge thanks for the deep investigation into this issue, @PrettyWood and @jmcnamara! Learned a lot from reading this discussion.

I would trust you to make the right decisions if and where this should be changed. I also hope that this won't have any negative impact on other users of calamine and fastexcel (such as performance).

PrettyWood commented 1 month ago

We chose to support it directly in fastexcel for now. It should have very little impact because it's only applied for float to string conversion