frictionlessdata / datapackage-pipelines

Framework for processing data packages in pipelines of modular components.
https://frictionlessdata.io/
MIT License
117 stars 32 forks source link

Run DPP through python #180

Closed cschloer closed 4 years ago

cschloer commented 4 years ago

Is there any way to expose the DPP functionality to python? I played around a bunch with importing the functions here but in the end the only thing I was really able to do was save a pipeline-spec.yaml to the file system and open a subprocess to run DPP.

It would be great if:

  1. You could pass in a dictionary that described the pipeline-spec.yaml (including processors, etc.) rather than having to use the file system
  2. It returned some kind of async object you could join on
  3. There was an option to actually get the result of the pipeline run from joining.

To elaborate a bit on #3 - currently I am programmatically adding a dump_to_path step at the very end, waiting for the pipeline to finish, and then reading the file that was dumped. It would be much better if the results of the pipeline (AKA all of the rows and the datapackage) were just returned by the function call.

roll commented 4 years ago

Hi @akariv, could you please take a look?

roll commented 4 years ago

@cschloer I'm putting it on hold from my side for now because:

PS. Just curious whether there is a possibility to create a bridge between DPP and dataflows and run dataflows under the hood instead of DPP. It would feel more natural when we need to get a response (data/metadata) programmatically

cschloer commented 4 years ago

I think it would be great to transition to only using dataflows. Most of our custom processors are still written in the old DPP structure but it would be worth it to transition them entirely to dataflows if it meant getting access to stuff programmatically. My only concern is still having the structure of pipeline-spec.yaml --> some open source, widely available program --> processed data. It would be theoretically be easy to write something that parses a pipeline-spec.yaml and runs all of the relevant dataflows, but it's also important that it be easy to download and easy to run on the command line (which was why dpp was so great)

roll commented 4 years ago

Yea I think an ability to run DPP steps in dataflows/dataflows-bridge would be great