ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
102 stars 6 forks source link

29.02 cell cast to 29.020000000000003 string #289

Open severinh opened 2 days ago

severinh commented 2 days ago

Steps to reproduce

See the Excel file decimal-numbers.xlsx. It contains a column with two decimals:

Decimals
28.14
29.02

Read the Excel file using fastexcel:

read_excel("decimal-numbers.xlsx").load_sheet_by_name("Sheet1").to_polars()
shape: (2, 1)
┌──────────┐
│ Decimals │
│ ---      │
│ f64      │
╞══════════╡
│ 28.14    │
│ 29.02    │
└──────────┘

Looks fine.

Then read the Excel file while casting to strings:

read_excel("decimal-numbers.xlsx").load_sheet_by_name("Sheet1", dtypes={0: 'string'}).to_polars()
shape: (2, 1)
┌────────────────────┐
│ Decimals           │
│ ---                │
│ str                │
╞════════════════════╡
│ 28.14              │
│ 29.020000000000003 │
└────────────────────┘

The expected result was that the strings would be the same as shown to the user in the Excel file:

shape: (2, 1)
┌──────────┐
│ Decimals │
│ ---      │
│ str      │
╞══════════╡
│ 28.14    │
│ 29.02    │
└──────────┘

Wrap-up

I understand this looks like an issue due to floating point precision, and I'm not sure if this:

  1. could be fixed in fastexcel
  2. could be fixed in the underlying calamine.
  3. cannot be fixed at all, since it's a fundamental property of the Excel file format or parsing process.

What's the motivation for filing this bug: In our system, we have highly heterogeneous data, so we have to read all Excel values as strings. However, if users see 29.02 in their Excel files, but 29.020000000000003 in our system, that's highly confusing and surprising to users.

What do you think?

PrettyWood commented 2 days ago

Hello @severinh Always great to have a clear bug issue with file and clear explanation thank you! I checked quickly and indeed it feels hard to fix on fastexcel side

In src/data.rs, in create_string_array function we simply do

cell.as_string()

Even with

cell.get_float().map(|v| v.to_string())

or

cell.as_f64()

we get the same behavior with this floating precision issue and AFAIK there is no way for us to get more info on the original input. I'll try to dig on calamine side tonight

EDIT: quick check on calamine

    inner: [
        SharedString(
            "Decimals",
        ),
        Float(
            28.14,
        ),
        Float(
            29.020000000000003,
        ),
    ],

when reading the xml content we get

BytesText { content: Borrowed("29.020000000000003") }

so I guess it goes even further directly in the xml content 😞