Closed lmmx closed 1 month ago
Hi @lmmx! Thanks for both the comprehensive issue and solution!
I must admit to not being entirely sure why Jakob introduced the read_csv
functionality (I've taken co-ownership of this project). If I understand correctly from the docstring, it's used for cases where the header doesn't exist.
Do you think this is a common usecase? Or is it a bit niche that someone wants validation on a csv file that doesn't have a header? I'm just wondering if this might be best solved by removing the read_csv functionality entirely, and just relying on polars directly for reading in csvs.
I actually took a look at your issue a few days ago, and decided that it would be nice to have the AliasGenerator support built into the models directly, not only for the read_csv
case. I accidentally added this to the ruff PR 😓 and didn't notice before merging. I've just added some tests that I meant to go together with it in #61.
Could you take a look using the latest main branch and see if this solves parts of your problem?
Hey @thomasaarholt, nice work!
No, pt.read_csv
exists because pl.read_csv
infers schema dtypes, whereas with Patito we always want to proactively define the dtypes in the data model. The docstring mentioning reading CSV files without headers is in the examples section, i.e. it's not the primary purpose, just a good demonstration of what it can do.
It makes sense to have a CSV reader method but not an Arrow/Parquet one because Parquet files embed the schema types, whereas CSV does not and so must always either provide them (passing the model dtypes) or infer (not perfect).
I was motivated to pick up this issue when I encountered a misinferred schema (a dataset of train stations' platform codes with numeric and alphanumeric values was misinferred as int
when it was str
typed).
I can reproduce that error with a toy example here, lowering the value of the rows used to infer the schema to recreate the scenario of a "surprise" value that clashes with the inferred schema. (This was reinstalled on the main branch)
>>> import polars as pl
>>> from io import StringIO
>>> inp1 = StringIO("col_a,col_b\n1,2\n10,B")
>>> pl.read_csv(inp1)
shape: (2, 2)
┌───────┬───────┐
│ col_a ┆ col_b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═══════╪═══════╡
│ 1 ┆ 2 │
│ 10 ┆ B │
└───────┴───────┘
But when the inferred schema length is too short you get the same error I initially hit
I'll take a look at the other aspect later, feeling a bit under the weather today! 🤒 I see it was put in in this commit:
Edit If I'm reading it correctly, the code you introduced in the _transform_df
function will run on call to pt.DataFrame.validate()
. This is later than the step at which read_csv
runs (the model is used here for its dtypes and the alias generator is used for the field to column name transformations). I'll rebase and see how this works alongside the read_csv
code, looks fine at first glance.
Thanks for doing this! I rebased this on latest main, and it's looking good!
If the header is to be read by the Polars CSV reader, and if the Pydantic model has an alias generator, then the columns should be expected to be given by applying the aliasing function on the model fields.
This solves the issue reported in:
55