frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
710 stars 148 forks source link

Improve pandas plugin performance for large datasets #402

Closed roll closed 4 years ago

roll commented 7 years ago

Overview

Here is a usage feedback:

but performance for large datasets was the main issue. Our users expect us to load a pandas data frame just as fast as they are able to with DataFrame.read_csv(). It might be a good idea to have that as your benchmark.

Plan

danfowler commented 7 years ago

This is pretty important, I would say. Pandas is something I would like to reach for quickly demonstrating Data Packages. However, this library is currently impractical for a dataset of any appreciable size. As an example of the difference: using the storage API to create a set of pandas Data Frames from a modified CMOA collection dataset (28,000 rows) takes over 1 minute. Looping through the resources and calling pandas native .read_csv() on their remote paths takes about 2 seconds.

https://notebooks.azure.com/dfowler/libraries/frictionlessdata/html/collection.ipynb

pwalsh commented 7 years ago

I'm sure this is because of row-wise iteration. I haven't checked the pandas source code but I bet they do not read by row, and probably have custom C code to handle the serialisation of data in general. I think this is directly related to https://github.com/frictionlessdata/tableschema-py/issues/169

roll commented 7 years ago

We need to profile it to be sure. Also e.g. sql driver writes data with buffering so we just need improve the pandas driver performance I suppose. It can't be some storage design limitation on so small datasets (<100k rows)

roll commented 4 years ago

With Frictionless Framework:

https://github.com/cmoa/collection/blob/master/cmoa.csv

import pandas as dp
from frictionless import Resource

df = Resource(path="tmp/cmoa.csv").to_pandas().dataframe # 5s
df = dp.read_csv("tmp/cmoa.csv") # 1s

This difference is something we are about to target (maybe can do better a little bit) based on the fact that Frictionless also validate/normalize data and does unlimited row-by-row streaming.