frictionlessdata / datapackage-pipelines

Framework for processing data packages in pipelines of modular components.
https://frictionlessdata.io/
MIT License
119 stars 32 forks source link

add split processor to standard library #109

Open OriHoch opened 6 years ago

OriHoch commented 6 years ago

As a dpp user, I want to split or shard data from a single or more resources to a single or more other resources based on certain conditions

Use cases:

see documentation and tests for this suggested processor here

akariv commented 6 years ago

At least one use case for this could be accomplished by adding a parameter to the filter processor in which it should create a new resource for the filtered rows instead of working on the source resource. A second one could be accomplished by using a 'group-by' sort of processor, which takes a sorted stream and splits it to multiple streams based on a "key" (composed out of values in specific columns). The main problem with the latter is that you need to know in advance the list of distinct values in the data (so that you can modify the resource list in the datapackage), which complicates significantly the implementation.