JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
323 stars 25 forks source link

Speed up `iter_models` (x2-x3 improvement) #117

Closed gab23r closed 1 month ago

gab23r commented 1 month ago

It is actually quite slow to iterate over a polars dataframe with the index, it it better to use iter_rows.


import patito as pt

class MyModel(pt.Model):
    a: int
df = MyModel.DataFrame(pl.DataFrame({"a": range(1_000_000)}))

# main
%timeit -r1 -n1 list(df.iter_models(validate_model=False)) # 7.56 s
%timeit -r1 -n1 list(df.iter_models(validate_model=True)) # 6.62 s

# pr
%timeit -r1 -n1 list(df.iter_models(validate_model=False)) # 3.53 s
%timeit -r1 -n1 list(df.iter_models(validate_model=True)) # 2.39 s 

Maybe we should expose buffer_size as well ?

NB: Weird that it is faster to validate the model, right ?

thomasaarholt commented 1 month ago

NB: Weird that it is faster to validate the model, right ?

Yes. I immediately checked the dataframe dtypes and height, but they are unchanged. I have no idea what is going on there 😅

thomasaarholt commented 1 month ago

Good work! Let's leave buffer_size out of the equation until someone needs it.