ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
120 stars 6 forks source link

Coercing dates to strings adds time of the form '00:00:00' #252

Open severinh opened 4 months ago

severinh commented 4 months ago

How to reproduce

Suppose you have an Excel file with the following mixed-type data:

Header
=DATE(2024, 7, 1)
=DATE(2024, 7, 2)
"some string"

Now lets read this Excel file into a Polars dataframe, which coerces the column to strings: excel_reader.load_sheet(0).to_polars()

This produces the following data frame:

Header
"2024-07-01 00:00:00"
"2024-07-02 00:00:00"
"some string"

Expected behavior

What I would have expected fastexcel to not include the time 00:00:00 for these dates. That is,

Header
"2024-07-01"
"2024-07-02"
"some string"

Test case

Excel sheet: sheet-date.xlsx

def test_date_casting_to_string() -> None:
    excel_reader = fastexcel.read_excel(path_for_fixture("sheet-date.xlsx"))
    sheet = excel_reader.load_sheet(0, column_names=["col1"])
    expected = {
        "col1": ["2024-07-01", "2024-07-02", "some string"],
    }
    pl_assert_frame_equal(sheet.to_polars(), pl.DataFrame(expected))
PrettyWood commented 4 months ago

Yup the issue is that it's detected as datetime by calamine. We can try to do a as_date() afterwards but it would be better if calamine could handle directly dates. I'll dig into into it this weekend but it won't be a quick fix like the bool to string. I'll probably need to open a PR on calamine side

severinh commented 4 months ago

Thanks @PrettyWood! I suspected that this one would not be as straight-forward.

In any case, no pressure on this. In contrast to #250, this issue here is not blocking our adoption of fastexcel. We can work around this for now by trimming 00:00:00 as a post-processing step. Something like:

sheet_df.with_columns(pl.all().str.replace_all(r"^([0-9]{4}-[0-9]{2}-[0-9]{2}) 00:00:00$", "$1"))