ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
120 stars 6 forks source link

feat: added support for a dtypes parameter #195

Closed lukapeschke closed 8 months ago

lukapeschke commented 8 months ago

closes #173

lukapeschke commented 8 months ago

Sorry for the delay on this one, I was AFK for the last couple of days

PrettyWood commented 8 months ago

I tried to play with it. Overall looks good! the dtypes override with datetime with "duration" seems weird

ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "duration"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬──────────────┬────────┐
│ Date                ┆ Country     ┆ Confirmed ┆ Recovered    ┆ Deaths │
│ ---                 ┆ ---         ┆ ---       ┆ ---          ┆ ---    │
│ datetime[ms]        ┆ str         ┆ i64       ┆ duration[ms] ┆ i64    │
╞═════════════════════╪═════════════╪═══════════╪══════════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0         ┆ null         ┆ 0      │
│ …                   ┆ …           ┆ …         ┆ …            ┆ …      │
│ 2020-09-28 00:00:00 ┆ India       ┆ 6145291   ┆ null         ┆ 96318  │
│ 2020-09-29 00:00:00 ┆ India       ┆ 6225763   ┆ null         ┆ 97497  │
│ 2020-09-30 00:00:00 ┆ India       ┆ 6312584   ┆ null         ┆ 98678  │
│ 2020-10-01 00:00:00 ┆ India       ┆ 6394068   ┆ null         ┆ 99773  │
│ 2020-10-02 00:00:00 ┆ India       ┆ 6473544   ┆ null         ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴──────────────┴────────┘

compared to

ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "string"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬───────────┬────────┐
│ Date                ┆ Country     ┆ Confirmed ┆ Recovered ┆ Deaths │
│ ---                 ┆ ---         ┆ ---       ┆ ---       ┆ ---    │
│ datetime[ms]        ┆ str         ┆ i64       ┆ str       ┆ i64    │
╞═════════════════════╪═════════════╪═══════════╪═══════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0         ┆ 0         ┆ 0      │
│ …                   ┆ …           ┆ …         ┆ …         ┆ …      │
│ 2020-09-28 00:00:00 ┆ India       ┆ 6145291   ┆ 5101397   ┆ 96318  │
│ 2020-09-29 00:00:00 ┆ India       ┆ 6225763   ┆ 5187825   ┆ 97497  │
│ 2020-09-30 00:00:00 ┆ India       ┆ 6312584   ┆ 5273201   ┆ 98678  │
│ 2020-10-01 00:00:00 ┆ India       ┆ 6394068   ┆ 5352078   ┆ 99773  │
│ 2020-10-02 00:00:00 ┆ India       ┆ 6473544   ┆ 5427706   ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴───────────┴────────┘
ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "datetime"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬───────────────────────┬────────┐
│ Date                ┆ Country     ┆ Confirmed ┆ Recovered             ┆ Deaths │
│ ---                 ┆ ---         ┆ ---       ┆ ---                   ┆ ---    │
│ datetime[ms]        ┆ str         ┆ i64       ┆ datetime[ms]          ┆ i64    │
╞═════════════════════╪═════════════╪═══════════╪═══════════════════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31 00:00:00   ┆ 0      │
│ …                   ┆ …           ┆ …         ┆ …                     ┆ …      │
│ 2020-09-28 00:00:00 ┆ India       ┆ 6145291   ┆ +15867-02-23 00:00:00 ┆ 96318  │
│ 2020-09-29 00:00:00 ┆ India       ┆ 6225763   ┆ +16103-10-12 00:00:00 ┆ 97497  │
│ 2020-09-30 00:00:00 ┆ India       ┆ 6312584   ┆ +16337-07-13 00:00:00 ┆ 98678  │
│ 2020-10-01 00:00:00 ┆ India       ┆ 6394068   ┆ +16553-06-27 00:00:00 ┆ 99773  │
│ 2020-10-02 00:00:00 ┆ India       ┆ 6473544   ┆ +16760-07-20 00:00:00 ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴───────────────────────┴────────┘
ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "date"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬──────────────┬────────┐
│ Date                ┆ Country     ┆ Confirmed ┆ Recovered    ┆ Deaths │
│ ---                 ┆ ---         ┆ ---       ┆ ---          ┆ ---    │
│ datetime[ms]        ┆ str         ┆ i64       ┆ date         ┆ i64    │
╞═════════════════════╪═════════════╪═══════════╪══════════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0         ┆ 1899-12-31   ┆ 0      │
│ …                   ┆ …           ┆ …         ┆ …            ┆ …      │
│ 2020-09-28 00:00:00 ┆ India       ┆ 6145291   ┆ +15867-02-23 ┆ 96318  │
│ 2020-09-29 00:00:00 ┆ India       ┆ 6225763   ┆ +16103-10-12 ┆ 97497  │
│ 2020-09-30 00:00:00 ┆ India       ┆ 6312584   ┆ +16337-07-13 ┆ 98678  │
│ 2020-10-01 00:00:00 ┆ India       ┆ 6394068   ┆ +16553-06-27 ┆ 99773  │
│ 2020-10-02 00:00:00 ┆ India       ┆ 6473544   ┆ +16760-07-20 ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴──────────────┴────────┘

But maybe we can merge the PR like that and dig a bit more into "duration", "datetime" support with format, mix of strings + dates... WDTY @lukapeschke ?

lukapeschke commented 8 months ago

Hi @PrettyWood thanks for the testing!

About your remarks:

PrettyWood commented 8 months ago

But I don't know if converting a datetime to a duration really makes sense, does it ?

Can probably be tackled in another PR (if needed)

Yes, seems reasonable, I also considered it. My proposal would be to rename the property to specified_dtypes, and add a def get_dtypes(self) -> DTypeMap method, since dtype computation might be expensive, WDYT ?

Yes, perfect

Indeed, the behaviour is not clear

Not sure if I was explicit enough. If I write dtypes = {1: "boolean"} I expect the second column to be casted. It doesn't work. It works with __UNNAMED__1 though

In another PR

Completely fine with me. Was just a remark. Your proposal seems ok. We can probably write that in an issue, discuss the implementation and then open the PR

lukapeschke commented 8 months ago

@PrettyWood 94e59d4 allows for dtypes to be specified by name or index in case all columns are selected and ef55f27 renames the dtypes property to specified_dtypes.

In the end, I haven't added the dtypes method, as #198 will provide that information