JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
252 stars 23 forks source link

`read_csv` does not use model dtypes when an alias generator is used #55

Open lmmx opened 4 months ago

lmmx commented 4 months ago

I'm trying out Patito on some real data and as far as I can tell it is inferring the column types (rather than using the model-specified field types) when using the read_csv helper, even though the docs suggest that they're used here:

Read CSV and apply correct column name and types from model.

I think this is happening due to the alias generator not being used to map the columns to the respective fields at this step (and this explains why converting to the data models first and then to DataFrames does work).

This would be nicer if fixed and done automatically.

In this case I find that on a dataset of a few thousand rows, with a column with a mix of numeric and alphanumeric values, if by chance the first few aren't alphanumeric then it gets inferred to be numeric (in this case int).

File attached for reproducibility:

My model definition is:

from __future__ import annotations
from tubeulator.utils.string_conv import to_camel_case
from pydantic import AliasGenerator, ConfigDict
from enum import Enum

from patito import Model

class LocationTypeEnum(Enum):
    Stop = "0"
    Station = "1"
    # EntranceOrExit = "2"
    GenericNode = "3"
    # BoardingArea = "4"

class Stop(Model):
    model_config = ConfigDict(
        alias_generator=AliasGenerator(validation_alias=to_camel_case),
    )

    StopId: str
    StopCode: str | None = None
    StopName: str
    # StopDesc: str = None
    StopLat: float | None = None
    StopLon: float | None = None
    LocationType: LocationTypeEnum
    ParentStation: str | None = None
    LevelId: str | None = None
    PlatformCode: str | None = None

The alias generator ensures that all the fields are aliased appropriately using the callable provided: to_camel_case.

Click to show `to_camel_case` funcdef ```py import re __all__ = ["to_camel_case", "to_pascal_case"] def replace_multi_with_single(string: str, char="_") -> str: """ Replace multiple consecutive occurrences of `char` with a single one. """ rep = char + char while rep in string: string = string.replace(rep, char) return string def to_camel_case(string: str) -> str: """ Convert a string to Camel Case. Examples:: >>> to_camel_case("ModeName") 'modeName' >>> to_camel_case("a_b_c") 'aBC' """ string = replace_multi_with_single(string.replace("-", "_").replace(" ", "_")) return string[0].lower() + re.sub( r"(?:_)(.)", lambda m: m.group(1).upper(), string[1:], ) ```
>>> Stop.DataFrame.read_csv("../data/stationdata/gtfs/stops.txt")

Traceback (most recent call last):                                                                                                                  
File "<stdin>", line 1, in <module>                                                                                                               
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/patito/polars.py", line 913, in read_csv                                  
df = cls.model.DataFrame._from_pydf(pl.read_csv(*args, **kwargs)._df)                                                                                                                 

File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper                       
return function(*args, **kwargs)                                                                                                                         

File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper                       
return function(*args, **kwargs)                                                                                                                         

File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper                       
return function(*args, **kwargs)                                                                                                                         

File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/io/csv/functions.py", line 397, in read_csv                        
df = pl.DataFrame._read_csv(                                                                                                                           
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/dataframe/frame.py", line 655, in _read_csv                        
self._df = PyDataFrame.read_csv(                                                                                                                             
polars.exceptions.ComputeError: could not parse `A` as dtype `i64` at column 'platform_code' (column number 10)                                                                                                                                                                                     
The current offset in the file is 162305 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `A` to the `null_values` list.

Original error: ```remaining bytes non-empty```

Works when the schema length is increased:

>>> Stop.DataFrame.read_csv("../data/stationdata/gtfs/stops.txt", infer_schema_length=10000, dtypes=Stop.dtypes)
shape: (6_384, 10)
┌───────────┬───────────────┬──────────────────────┬──────────┬───┬───────────────┬──────────┬───────────────────────────────────┬──────────┐
│ stop_code ┆ platform_code ┆ stop_name            ┆ stop_lon ┆ … ┆ location_type ┆ level_id ┆ stop_id                           ┆ stop_lat │
│ ---       ┆ ---           ┆ ---                  ┆ ---      ┆   ┆ ---           ┆ ---      ┆ ---                               ┆ ---      │
│ str       ┆ str           ┆ str                  ┆ f64      ┆   ┆ i64           ┆ str      ┆ str                               ┆ f64      │
╞═══════════╪═══════════════╪══════════════════════╪══════════╪═══╪═══════════════╪══════════╪═══════════════════════════════════╪══════════╡
│ HUBABW    ┆ null          ┆ Abbey Wood           ┆ null     ┆ … ┆ 1             ┆ null     ┆ HUBABW                            ┆ null     │
│ null      ┆ null          ┆ Outside Abbey Wood   ┆ null     ┆ … ┆ 3             ┆ null     ┆ HUBABW-Outside                    ┆ null     │
│ null      ┆ null          ┆ Bus                  ┆ 0.12128  ┆ … ┆ 3             ┆ L#1      ┆ HUBABW-1001001-Bus-5              ┆ 51.49238 │
...
│ null      ┆ 2             ┆ Westbound Platform 2 ┆ null     ┆ … ┆ 0             ┆ null     ┆ 910GBKRVS-Plat02-WB-london-overg… ┆ null     │
└───────────┴───────────────┴──────────────────────┴──────────┴───┴───────────────┴──────────┴───────────────────────────────────┴──────────┘

I tried passing the dtypes argument like the error message suggested but nothing happened, at which point I realised of course the column names get transformed by the alias generator when ingesting as Pydantic models.

The columns should be set as the correct types by applying the alias generator [or otherwise using per-field aliases] on the dtypes it passes through in the pt.DataFrame.read_csv method with an associated pt.Model class.

Solutions

This will only ever be a problem when has_header is True and if there's a model config specifying an alias_generator.

I put together a PR to contribute this feature:

It will also need to handle per-field aliases (not implemented initially).