datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
194 stars 39 forks source link

add CacheFlow class, improve cache implementation #30

Closed OriHoch closed 5 years ago

OriHoch commented 5 years ago

CacheFlow class allows to chain cache steps:

from dataflows import CacheFlow, cache

CacheFlow(
    load('http://example.com/very_large_resource.csv'),
    load('http://example.com/another_very_large_resource.csv'),
    cache(cache_path='.cache/very_large_resources'),
    load('http://example.com/another_resource.csv'),
    cache(cache_path='.cache/another_resource')
)

also, simplified implementation of cache processor

coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 122


Totals Coverage Status
Change from base Build 119: 0.2%
Covered Lines: 1088
Relevant Lines: 1389

💛 - Coveralls
coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 195


Totals Coverage Status
Change from base Build 191: 0.3%
Covered Lines: 1138
Relevant Lines: 1458

💛 - Coveralls
akariv commented 5 years ago

I'm pretty reluctant to introduce another 'Flow' class with some modified functionality.

What do you think of this pattern (creating a dual use for the cache processor):


Flow(
  cache(
    step1(),
    step2(),
    checkpoint(path='.cache/checkpoint_1'),
    step3(),
    step4(),
    checkpoint(path='.cache/checkpoint_2'),
  )
)

# equivalent to
Flow(
  cache(
    cache(
      step1(),
      step2(),
      path='.cache/checkpoint_1'
    ),
    step3(),
    step4(),
    path='.cache/checkpoint_2'
  )
)
OriHoch commented 5 years ago

Could incorporate it into flow and get rid of the cache processor altogether

Flow(
    step1(),
    step2(),
    checkpoint(path='.cache/checkpoint_1'),
    step3(),
    step4(),
    checkpoint(path='.cache/checkpoint_2'),
)
akariv commented 5 years ago

Yep, even better.

On Wed, Oct 17, 2018 at 10:36 AM Ori Hoch notifications@github.com wrote:

Could incorporate it into flow and get rid of the cache processor altogether

Flow( step1(), step2(), checkpoint(path='.cache/checkpoint_1'), step3(), step4(), checkpoint(path='.cache/checkpoint_2'), )

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/datahq/dataflows/pull/30#issuecomment-430522433, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQMdZeeyl9p5aQY63zZNV9k-Yf3s_zVks5ult33gaJpZM4XchTb .

OriHoch commented 5 years ago

@akariv fixed

OriHoch commented 5 years ago

@akariv fixed