datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
194 stars 39 forks source link

concurrency? #81

Closed SPTKL closed 5 years ago

SPTKL commented 5 years ago

can we please have concurrency? dataflows works great for small datasets, but it's rather time consuming when it comes to a slightly larger dataset

akariv commented 5 years ago

This is a great suggestion, which I've been thinking about for some time now.

Currently we have the datapackage-pipelines library which can provide concurrency and works well with dataflows, but I agree that it's best to add support for it without needing to install another dataset.

Since Python isn't very good with concurrency (because of the GIL), we would have to go with with a multi-process solution, and communicate using the stream / unsream processorts (which were designed exactly for this purpose). However, I'd like to design an API for it that would hide that complexity from end users.

If you can, please describe your use case - how would you use concurrency? what are the properties of your datasets? It would really help me to build this API better and more useful to actual users :)

SPTKL commented 5 years ago

We are trying to use dataflows as our main data production tool/interface. We love how it's clean, procedural and helps us with some of the data validation.

But our production pipeline is more than just python, we are integrating a lot of dockerized micro-services built on all kinds of stacks for more complex processing scenarios, for example we have an address cleaning api, a geocoding api and etc. So concurrency/multiprocessing can save us a lot of time when are running through steps where we have to make multiple api calls.

Also just a side idea, I think it would be useful to design the dataflows syntax similar to pandas or pyspark, just to make it easier for anyone to pick up. sometimes the parameters for some functions are soooo complicated, they are like nested dictionaries within a list or something.

rufuspollock commented 5 years ago

@SPTKL this is super useful feedback.

Could you give one specific example of applying pandas / pyspark style syntax to a current dataflows set up.

In terms of concurrency can you explain more what you mean? You mean that each processor can run concurrently (waiting on input from previous processor) or more like running more than one one row at once (assumes independence of rows).