Closed roll closed 4 years ago
This is pretty important, I would say. Pandas is something I would like to reach for quickly demonstrating Data Packages. However, this library is currently impractical for a dataset of any appreciable size. As an example of the difference: using the storage API to create a set of pandas Data Frames from a modified CMOA collection dataset (28,000 rows) takes over 1 minute. Looping through the resources and calling pandas native .read_csv()
on their remote paths takes about 2 seconds.
https://notebooks.azure.com/dfowler/libraries/frictionlessdata/html/collection.ipynb
I'm sure this is because of row-wise iteration. I haven't checked the pandas source code but I bet they do not read by row, and probably have custom C code to handle the serialisation of data in general. I think this is directly related to https://github.com/frictionlessdata/tableschema-py/issues/169
We need to profile it to be sure. Also e.g. sql driver writes data with buffering so we just need improve the pandas driver performance I suppose. It can't be some storage design limitation on so small datasets (<100k rows)
With Frictionless Framework:
import pandas as dp
from frictionless import Resource
df = Resource(path="tmp/cmoa.csv").to_pandas().dataframe # 5s
df = dp.read_csv("tmp/cmoa.csv") # 1s
This difference is something we are about to target (maybe can do better a little bit) based on the fact that Frictionless also validate/normalize data and does unlimited row-by-row streaming.
Overview
Here is a usage feedback:
Plan
tableschema-sql
implementation using buffering