Closed lukapeschke closed 8 months ago
Sorry for the delay on this one, I was AFK for the last couple of days
I tried to play with it. Overall looks good!
the dtypes
override with datetime
with "duration"
seems weird
ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "duration"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬──────────────┬────────┐
│ Date ┆ Country ┆ Confirmed ┆ Recovered ┆ Deaths │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ str ┆ i64 ┆ duration[ms] ┆ i64 │
╞═════════════════════╪═════════════╪═══════════╪══════════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0 ┆ null ┆ 0 │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0 ┆ null ┆ 0 │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0 ┆ null ┆ 0 │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0 ┆ null ┆ 0 │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0 ┆ null ┆ 0 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 2020-09-28 00:00:00 ┆ India ┆ 6145291 ┆ null ┆ 96318 │
│ 2020-09-29 00:00:00 ┆ India ┆ 6225763 ┆ null ┆ 97497 │
│ 2020-09-30 00:00:00 ┆ India ┆ 6312584 ┆ null ┆ 98678 │
│ 2020-10-01 00:00:00 ┆ India ┆ 6394068 ┆ null ┆ 99773 │
│ 2020-10-02 00:00:00 ┆ India ┆ 6473544 ┆ null ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴──────────────┴────────┘
compared to
ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "string"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬───────────┬────────┐
│ Date ┆ Country ┆ Confirmed ┆ Recovered ┆ Deaths │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ str ┆ i64 ┆ str ┆ i64 │
╞═════════════════════╪═════════════╪═══════════╪═══════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0 ┆ 0 ┆ 0 │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0 ┆ 0 ┆ 0 │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0 ┆ 0 ┆ 0 │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0 ┆ 0 ┆ 0 │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0 ┆ 0 ┆ 0 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 2020-09-28 00:00:00 ┆ India ┆ 6145291 ┆ 5101397 ┆ 96318 │
│ 2020-09-29 00:00:00 ┆ India ┆ 6225763 ┆ 5187825 ┆ 97497 │
│ 2020-09-30 00:00:00 ┆ India ┆ 6312584 ┆ 5273201 ┆ 98678 │
│ 2020-10-01 00:00:00 ┆ India ┆ 6394068 ┆ 5352078 ┆ 99773 │
│ 2020-10-02 00:00:00 ┆ India ┆ 6473544 ┆ 5427706 ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴───────────┴────────┘
ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "datetime"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬───────────────────────┬────────┐
│ Date ┆ Country ┆ Confirmed ┆ Recovered ┆ Deaths │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ str ┆ i64 ┆ datetime[ms] ┆ i64 │
╞═════════════════════╪═════════════╪═══════════╪═══════════════════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 00:00:00 ┆ 0 │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 00:00:00 ┆ 0 │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 00:00:00 ┆ 0 │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 00:00:00 ┆ 0 │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 00:00:00 ┆ 0 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 2020-09-28 00:00:00 ┆ India ┆ 6145291 ┆ +15867-02-23 00:00:00 ┆ 96318 │
│ 2020-09-29 00:00:00 ┆ India ┆ 6225763 ┆ +16103-10-12 00:00:00 ┆ 97497 │
│ 2020-09-30 00:00:00 ┆ India ┆ 6312584 ┆ +16337-07-13 00:00:00 ┆ 98678 │
│ 2020-10-01 00:00:00 ┆ India ┆ 6394068 ┆ +16553-06-27 00:00:00 ┆ 99773 │
│ 2020-10-02 00:00:00 ┆ India ┆ 6473544 ┆ +16760-07-20 00:00:00 ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴───────────────────────┴────────┘
ipdb> excel_file.load_sheet_by_idx(0, dtypes={"Recovered": "date"}).to_polars()
shape: (65_535, 5)
┌─────────────────────┬─────────────┬───────────┬──────────────┬────────┐
│ Date ┆ Country ┆ Confirmed ┆ Recovered ┆ Deaths │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ str ┆ i64 ┆ date ┆ i64 │
╞═════════════════════╪═════════════╪═══════════╪══════════════╪════════╡
│ 2020-01-22 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 ┆ 0 │
│ 2020-01-23 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 ┆ 0 │
│ 2020-01-24 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 ┆ 0 │
│ 2020-01-25 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 ┆ 0 │
│ 2020-01-26 00:00:00 ┆ Afghanistan ┆ 0 ┆ 1899-12-31 ┆ 0 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 2020-09-28 00:00:00 ┆ India ┆ 6145291 ┆ +15867-02-23 ┆ 96318 │
│ 2020-09-29 00:00:00 ┆ India ┆ 6225763 ┆ +16103-10-12 ┆ 97497 │
│ 2020-09-30 00:00:00 ┆ India ┆ 6312584 ┆ +16337-07-13 ┆ 98678 │
│ 2020-10-01 00:00:00 ┆ India ┆ 6394068 ┆ +16553-06-27 ┆ 99773 │
│ 2020-10-02 00:00:00 ┆ India ┆ 6473544 ┆ +16760-07-20 ┆ 100842 │
└─────────────────────┴─────────────┴───────────┴──────────────┴────────┘
But maybe we can merge the PR like that and dig a bit more into "duration", "datetime" support with format, mix of strings + dates... WDTY @lukapeschke ?
Hi @PrettyWood thanks for the testing!
About your remarks:
the dtypes override with datetime with "duration" seems weird
I guess it's because we're still using as_duration
, which does not seem to work with DateTimeIso
, but we'd need some debug prints for this. Or it might be an overflow. But I don't know if converting a datetime to a duration really makes sense, does it ?
could we have all the dtypes? (not only the overridden ones)?
Yes, seems reasonable, I also considered it. My proposal would be to rename the property to specified_dtypes
, and add a def get_dtypes(self) -> DTypeMap
method, since dtype computation might be expensive, WDYT ?
It does not seem to work with my tests
Indeed, the behaviour is not clear. The thing is that here, I went with "if columns are specified by name or not specified, go with name, otherwise got with index". For now, the solution would probably be a third closure in the case of SelectedColumns::All
, where we would try by name, then by index, and finally try to guess.
I reckon we could support dict[str | int, Dtype] for unnamed columns (instead of recalling to write UNNAMED1. ..)
In another PR, I'd like to adapt SelectedColumns
to allow for mixed indices and names, and implement that for dtype selection as well. The sheet's columns would probably be stored as a struct containing their name and index (and maybe have a distinction between the provided name of guessed name ? or have an enum for that, something like "provided" | "looked_up" | "generated_unnamed"
). This would allow for less computations in the end, and more information being exposed to the user. What's your opinion on this ?
In any case, I think this would rather be something for v0.11.0, since there already have been a lot of changes for v0.10.0.
But I don't know if converting a datetime to a duration really makes sense, does it ?
Can probably be tackled in another PR (if needed)
Yes, seems reasonable, I also considered it. My proposal would be to rename the property to specified_dtypes, and add a def get_dtypes(self) -> DTypeMap method, since dtype computation might be expensive, WDYT ?
Yes, perfect
Indeed, the behaviour is not clear
Not sure if I was explicit enough. If I write dtypes = {1: "boolean"}
I expect the second column to be casted. It doesn't work. It works with __UNNAMED__1
though
In another PR
Completely fine with me. Was just a remark. Your proposal seems ok. We can probably write that in an issue, discuss the implementation and then open the PR
@PrettyWood 94e59d4 allows for dtypes to be specified by name or index in case all columns are selected and ef55f27 renames the dtypes
property to specified_dtypes
.
In the end, I haven't added the dtypes
method, as #198 will provide that information
closes #173