Closed severinh closed 1 month ago
Hello @severinh
Always great to have a clear bug issue with file and clear explanation thank you!
I checked quickly and indeed it feels hard to fix on fastexcel
side
In src/data.rs
, in create_string_array
function we simply do
cell.as_string()
Even with
cell.get_float().map(|v| v.to_string())
or
cell.as_f64()
we get the same behavior with this floating precision issue and AFAIK there is no way for us to get more info on the original input. I'll try to dig on calamine side tonight
EDIT: quick check on calamine
inner: [
SharedString(
"Decimals",
),
Float(
28.14,
),
Float(
29.020000000000003,
),
],
when reading the xml content we get
BytesText { content: Borrowed("29.020000000000003") }
so I guess it goes even further directly in the xml content 😞
The floating point number in the Excel file is 29.020000000000003:
<row r="2" spans="1:1" x14ac:dyDescent="0.2">
<c r="A2" s="1">
<v>28.14</v>
</c>
</row>
<row r="3" spans="1:1" x14ac:dyDescent="0.2">
<c r="A3" s="1">
<v>29.020000000000003</v>
</c>
</row>
For the display (and storage) of floating point numbers Excel uses something very similar to the printf()
format option %.16G
:
$ python
>>> print("%.16G" % 29.020000000000003)
29.02
That format option isn't supported by Rust in std::fmt however.
Thanks @jmcnamara indeed it's really in the xml content itself.
Maybe we could use rust_decimal
crate that implements already a "smart" Display
With
[dependencies]
rust_decimal = { version = "1.36.0", default-features = false }
use rust_decimal::prelude::*;
fn main() {
let x = 29.020000000000003_f64;
println!("x: {}", Decimal::from_f64(x).unwrap());
}
▶ cargo r -q
x: 29.02
WDYT?
For the display (and storage) of floating point numbers Excel uses something very similar to the
printf()
format option%.16G
:
I was wrong about this for the display part. Excel displays decimal numbers with 8 digits in a standard cell width and 10 digits in a wide cell. Like this:
Format = 0.000000000 | Unformatted, wide cell | Unformatted, standard width |
---|---|---|
1.123456789 | 1.123456789 | 1.123457 |
12.123456789 | 12.12345679 | 12.12346 |
123.123456789 | 123.1234568 | 123.1235 |
1234.123456789 | 1234.123457 | 1234.123 |
12345.123456789 | 12345.12346 | 12345.12 |
123456.123456789 | 123456.1235 | 123456.1 |
It also trims trailing zeros and the decimal point if there are no fractional part.
Again thank you for the info. Much appreciated. I reckon we should make a PR on calamine side. I can maybe do that this weekend
Huge thanks for the deep investigation into this issue, @PrettyWood and @jmcnamara! Learned a lot from reading this discussion.
I would trust you to make the right decisions if and where this should be changed. I also hope that this won't have any negative impact on other users of calamine
and fastexcel
(such as performance).
We chose to support it directly in fastexcel for now. It should have very little impact because it's only applied for float to string conversion
Steps to reproduce
See the Excel file decimal-numbers.xlsx. It contains a column with two decimals:
Read the Excel file using fastexcel:
Looks fine.
Then read the Excel file while casting to strings:
The expected result was that the strings would be the same as shown to the user in the Excel file:
Wrap-up
I understand this looks like an issue due to floating point precision, and I'm not sure if this:
What's the motivation for filing this bug: In our system, we have highly heterogeneous data, so we have to read all Excel values as strings. However, if users see 29.02 in their Excel files, but
29.020000000000003
in our system, that's highly confusing and surprising to users.What do you think?