I'm trying out Patito on some real data and as far as I can tell it is inferring the column types (rather than using the model-specified field types) when using the read_csv helper, even though the docs suggest that they're used here:
Read CSV and apply correct column name and types from model.
I think this is happening due to the alias generator not being used to map the columns to the respective fields at this step (and this explains why converting to the data models first and then to DataFrames does work).
This would be nicer if fixed and done automatically.
In this case I find that on a dataset of a few thousand rows, with a column with a mix of numeric and alphanumeric values, if by chance the first few aren't alphanumeric then it gets inferred to be numeric (in this case int).
The alias generator ensures that all the fields are aliased appropriately using the callable provided: to_camel_case.
Click to show `to_camel_case` funcdef
```py
import re
__all__ = ["to_camel_case", "to_pascal_case"]
def replace_multi_with_single(string: str, char="_") -> str:
"""
Replace multiple consecutive occurrences of `char` with a single one.
"""
rep = char + char
while rep in string:
string = string.replace(rep, char)
return string
def to_camel_case(string: str) -> str:
"""
Convert a string to Camel Case.
Examples::
>>> to_camel_case("ModeName")
'modeName'
>>> to_camel_case("a_b_c")
'aBC'
"""
string = replace_multi_with_single(string.replace("-", "_").replace(" ", "_"))
return string[0].lower() + re.sub(
r"(?:_)(.)",
lambda m: m.group(1).upper(),
string[1:],
)
```
>>> Stop.DataFrame.read_csv("../data/stationdata/gtfs/stops.txt")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/patito/polars.py", line 913, in read_csv
df = cls.model.DataFrame._from_pydf(pl.read_csv(*args, **kwargs)._df)
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/io/csv/functions.py", line 397, in read_csv
df = pl.DataFrame._read_csv(
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/dataframe/frame.py", line 655, in _read_csv
self._df = PyDataFrame.read_csv(
polars.exceptions.ComputeError: could not parse `A` as dtype `i64` at column 'platform_code' (column number 10)
The current offset in the file is 162305 bytes.
You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `A` to the `null_values` list.
Original error: ```remaining bytes non-empty```
I tried passing the dtypes argument like the error message suggested but nothing happened, at which point I realised of course the column names get transformed by the alias generator when ingesting as Pydantic models.
The columns should be set as the correct types by applying the alias generator [or otherwise using per-field aliases] on the dtypes it passes through in the pt.DataFrame.read_csv method with an associated pt.Model class.
Solutions
This will only ever be a problem when has_header is True and if there's a model config specifying an alias_generator.
I put together a PR to contribute this feature:
54
It will also need to handle per-field aliases (not implemented initially).
I'm trying out Patito on some real data and as far as I can tell it is inferring the column types (rather than using the model-specified field types) when using the
read_csv
helper, even though the docs suggest that they're used here:I think this is happening due to the alias generator not being used to map the columns to the respective fields at this step (and this explains why converting to the data models first and then to DataFrames does work).
This would be nicer if fixed and done automatically.
In this case I find that on a dataset of a few thousand rows, with a column with a mix of numeric and alphanumeric values, if by chance the first few aren't alphanumeric then it gets inferred to be numeric (in this case
int
).File attached for reproducibility:
My model definition is:
The alias generator ensures that all the fields are aliased appropriately using the callable provided:
to_camel_case
.Click to show `to_camel_case` funcdef
```py import re __all__ = ["to_camel_case", "to_pascal_case"] def replace_multi_with_single(string: str, char="_") -> str: """ Replace multiple consecutive occurrences of `char` with a single one. """ rep = char + char while rep in string: string = string.replace(rep, char) return string def to_camel_case(string: str) -> str: """ Convert a string to Camel Case. Examples:: >>> to_camel_case("ModeName") 'modeName' >>> to_camel_case("a_b_c") 'aBC' """ string = replace_multi_with_single(string.replace("-", "_").replace(" ", "_")) return string[0].lower() + re.sub( r"(?:_)(.)", lambda m: m.group(1).upper(), string[1:], ) ```Works when the schema length is increased:
I tried passing the
dtypes
argument like the error message suggested but nothing happened, at which point I realised of course the column names get transformed by the alias generator when ingesting as Pydantic models.The columns should be set as the correct types by applying the alias generator [or otherwise using per-field aliases] on the dtypes it passes through in the
pt.DataFrame.read_csv
method with an associatedpt.Model
class.Solutions
This will only ever be a problem when
has_header
isTrue
and if there's a model config specifying analias_generator
.I put together a PR to contribute this feature:
54
It will also need to handle per-field aliases (not implemented initially).