frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
696 stars 146 forks source link

Add caching mechanism and rework remote loader? #438

Open vitorbaptista opened 8 years ago

vitorbaptista commented 8 years ago

Originally by @femtotrader on https://github.com/trickvi/datapackage/issues/61:

Hello,

I think datapackage should provide a cache mechanism.

For this (if user want to have this cache mechanism 2 optional dependencies could be requests and requests-cache)

One possible use could be :

import datapackage
import requests_cache
import datetime
session = requests_cache.CachedSession(cache_name='cache', backend='sqlite', expire_after=datetime.timedelta(days=60))
datapkg = datapackage.DataPackage('http://data.okfn.org/data/cpi/', session=session)

Default value of parameter session should beNone`. This session should be stored as a member of DataPackage.

When session is not None request will be performed using

self.session.get(url)

Kind regards

PS : a similar approach was used in https://github.com/femtotrader/pandas_datareaders_unofficial

edit: and is now (oct 2015) used in official "pandas-datareader" https://github.com/pydata/pandas-datareader/

see also pydata/pandas-datareader#48

femtotrader commented 8 years ago

This cache mechanism is very important if we want to test datapackage library with all datapackages in /datasets organisation So see also https://github.com/frictionlessdata/testsuite-py/issues/12

femtotrader commented 8 years ago

Any news about this ? it will help a lot https://github.com/datasets/registry/issues/114

vitorbaptista commented 8 years ago

Just adding that for testing purposes, instead of implementing caching on the library itself, we can use it only on tests using a library like https://github.com/sigmavirus24/betamax.

femtotrader commented 8 years ago
vcr.use_cassette('user')

looks like the monkey patch approach of requests-cache http://requests-cache.readthedocs.io/en/latest/user_guide.html#installation

requests_cache.install_cache()

I'm not a big fan of this approach. I prefer passing a session object this is much simpler than their approach using context manager (with)... (my 2 cts)

It's not really about "implementing caching on the library itself"

It's just about changing calls like

requests.get(...)

to

session.get(...)

but anyway whatever your technical choices are, what is important is to be able to know quickly what datapackages in https://github.com/datasets/registry are not valid. But it seems that Rufus have some ideas / plans.

vitorbaptista commented 8 years ago

Just to be clear, I'm pointing out betamax because, as far as I can see, the reason you're suggesting this task is to allow us to monitor the datasets. With it, we can solve the testing issue without having to add more code to the datapackage-py.

femtotrader commented 8 years ago

I didn't know betamax previously. Both can be used with the monkey patch approach and so can allow to monitor the datasets.

roll commented 8 years ago

I suppose it's kinda blocked by https://github.com/frictionlessdata/specs/issues/243

pwalsh commented 8 years ago

@roll while frictionlessdata/datapackage-py#243 refers to a cache property, it has a different use and meaning to the above as far as I see.

roll commented 7 years ago

It's also related to https://github.com/frictionlessdata/goodtables-py/issues/140 - both could require providing some custom requests session to tabulator. But caching could lead to other kind of problems like memory usage (we do streaming for everything).

So this issue is something to investigate.