Open pwalsh opened 7 years ago
This is in fact describing a map/reduce framework, no? I think that it might be better to leave the task of orchestrating the map/reduce tasks to a dedicated framework (e.g. Hadoop). The tasks themselves, however, could be implemented with dpp, of course.
Yes, it is map/reduce. I guess we can explore how we use other frameworks like Hadoop for orchestration.
Description
@danfowler has recently been using DPP for some very large source files (8GB CSV). With the default way pipelines here work, processing this data via a single conceptual stream is too slow.
There are various ways to deal with this:
I'd like to explore option 3. @akariv what are your thoughts?
From my very high-level view of the framework, there are no inherent design barriers to this. The only thing I guess is that the descriptor object is mutable global state, and maybe the caching mechanisms too. This might mean that spawned processors should only work on the data sources, and not on metadata.