JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
252 stars 23 forks source link

feat: use model field alias generator appropriately for CSV column names #54

Closed lmmx closed 1 month ago

lmmx commented 4 months ago

If the header is to be read by the Polars CSV reader, and if the Pydantic model has an alias generator, then the columns should be expected to be given by applying the aliasing function on the model fields.

This solves the issue reported in:

thomasaarholt commented 4 months ago

Hi @lmmx! Thanks for both the comprehensive issue and solution!

I must admit to not being entirely sure why Jakob introduced the read_csv functionality (I've taken co-ownership of this project). If I understand correctly from the docstring, it's used for cases where the header doesn't exist.

Do you think this is a common usecase? Or is it a bit niche that someone wants validation on a csv file that doesn't have a header? I'm just wondering if this might be best solved by removing the read_csv functionality entirely, and just relying on polars directly for reading in csvs.

I actually took a look at your issue a few days ago, and decided that it would be nice to have the AliasGenerator support built into the models directly, not only for the read_csv case. I accidentally added this to the ruff PR 😓 and didn't notice before merging. I've just added some tests that I meant to go together with it in #61.

Could you take a look using the latest main branch and see if this solves parts of your problem?

lmmx commented 4 months ago

Hey @thomasaarholt, nice work!

No, pt.read_csv exists because pl.read_csv infers schema dtypes, whereas with Patito we always want to proactively define the dtypes in the data model. The docstring mentioning reading CSV files without headers is in the examples section, i.e. it's not the primary purpose, just a good demonstration of what it can do.

It makes sense to have a CSV reader method but not an Arrow/Parquet one because Parquet files embed the schema types, whereas CSV does not and so must always either provide them (passing the model dtypes) or infer (not perfect).

I was motivated to pick up this issue when I encountered a misinferred schema (a dataset of train stations' platform codes with numeric and alphanumeric values was misinferred as int when it was str typed).

I can reproduce that error with a toy example here, lowering the value of the rows used to infer the schema to recreate the scenario of a "surprise" value that clashes with the inferred schema. (This was reinstalled on the main branch)

>>> import polars as pl
>>> from io import StringIO
>>> inp1 = StringIO("col_a,col_b\n1,2\n10,B")
>>> pl.read_csv(inp1)
shape: (2, 2)
┌───────┬───────┐
│ col_a ┆ col_b │
│ ---   ┆ ---   │
│ i64   ┆ str   │
╞═══════╪═══════╡
│ 1     ┆ 2     │
│ 10    ┆ B     │
└───────┴───────┘

But when the inferred schema length is too short you get the same error I initially hit

Click to show error message ```py >>> inp2 = StringIO("col_a,col_b\n1,2\n10,B") >>> pl.read_csv(inp2, infer_schema_length=1) Traceback (most recent call last): File "", line 1, in File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper return function(*args, **kwargs) File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper return function(*args, **kwargs) File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper return function(*args, **kwargs) File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/io/csv/functions.py", line 397, in read_csv df = pl.DataFrame._read_csv( File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/dataframe/frame.py", line 655, in _read_csv self._df = PyDataFrame.read_csv( polars.exceptions.ComputeError: could not parse `B` as dtype `i64` at column 'col_b' (column number 2) The current offset in the file is 19 bytes. You might want to try: - increasing `infer_schema_length` (e.g. `infer_schema_length=10000`), - specifying correct dtype with the `dtypes` argument - setting `ignore_errors` to `True`, - adding `B` to the `null_values` list. Original error: ```remaining bytes non-empty``` ```

I'll take a look at the other aspect later, feeling a bit under the weather today! 🤒 I see it was put in in this commit:

Edit If I'm reading it correctly, the code you introduced in the _transform_df function will run on call to pt.DataFrame.validate(). This is later than the step at which read_csv runs (the model is used here for its dtypes and the alias generator is used for the field to column name transformations). I'll rebase and see how this works alongside the read_csv code, looks fine at first glance.

thomasaarholt commented 1 month ago

Thanks for doing this! I rebased this on latest main, and it's looking good!