Porcupine is a tool aimed at people who want to express in Haskell general data manipulation and analysis tasks,
Porcupine specifically targets teams containing skills ranging from those of data scientists to those of data/software engineers.
Porcupine's development happens mainly inside NovaDiscovery's internal codebase, where a porcupine's fork resides. But we often synchronise this internal repo and porcupine's github repo. This is why commits tend to appear by batches on porcupine's github.
Lately, a lot of effort has been invested in developping Kernmantle which should provide the new task representation (see below in Future plans).
Issues and MRs are welcome :)
These features are being developed and should land soon:
porcupine-servant
: a servant app can directly serve porcupine's pipelines as routes, and expose a single configuration for the whole serverrunPipelineTask
would remain in place but be a tiny wrapper over a slightly lower-level API. This makes it easier to run pipelines in different contexts (like that of porcupine-servant
)The following are things we'd like to start working on:
cas-store
: porcupine's dependency on funflow
is mainly for the purpose of caching. Now that cas-store
is a separate project, porcupine can directly depend on it. This will simplify the implementation of PTask
and make it easier to integrate PTask
s with other libraries.PTask
over a Kernmantle Rope: this is the main reason we started the work on Kernmantle, so it could become a uniform pipeline API, independent of the effects the pipeline performs (caching, collecting options or required resources, etc). Both porcupine and funflow would become collections of Kernmantle effects and handlers, and would therefore be seamlessly interoperable. Developpers would also be able to add their own custom effects to a pipeline. This would probably mean the death of reader-soup
, as the LocationAccessors could directly be embedded as Kernmatle effects.VirtualTree
as a separate package: all the code that is not strictly speaking related to tasks would be usable separately (for instance to be used in Kernmantle effects handlers).Porcupine uses Funflow internally to provide caching. Funflow's API is centered around the ArrowFlow class. PTask (porcupine's main computation unit) implements ArrowFlow too, so usual funflow operations are usable on PTasks too.
Aside from that, funflow and porcupine don't operate at the same level of abstraction: funflow is for software devs building applications the way they want, while porcupine is higher-level and more featureful, and targets software devs at the same time as modelers or data analysts. However, porcupine doesn't make any choice in terms of computation, visualization, etc. libraries or anything. That part is still up to the user.
The main goal of Porcupine is to be a tool to structure your app, a backbone that helps you kickstart e.g. a data pipeline/analytics application while keeping the boilerplate (config, I/O) to a minimum, while providing a common framework if you have code (tasks, serializing functions) to share between several applications of that type. But since the arrow and caching API is the same in both Funflow and Porcupine, as a software dev you can start by using porcupine, and if you realize you don't actually need the high level features (config, rebinding of inputs, logging, etc) then drop the dependency and transition to Funflow's level.
Funflow provides a worker demon that the main pipeline can distribute docker-containerized tasks to. For pure Haskell functions, there is funflow-jobs but it's experimental.
So it could be used with funflow-jobs, but for now porcupine has only ever been used for parallel execution of tasks. We recently started thinking about how the funflow/porcupine's model could be adapted to run a pipeline in a cluster in a decentralized fashion, and we have some promising ideas so that feature may appear in the future.
Another solution (which is the one used by our client) is to use an external job queue (like celery) which starts porcupine pipeline instances. This is made easy by the fact that all the configuration of a pipeline instance is exposed by porcupine, and therefore can be set by the program that puts the jobs in the queue (as one JSON file).
Of course! That means you would replace the call to runPipelineTask
by custom
code. You want to have a look at the splitTask
lens. It will separate a task
in its two components: its VirtualTree
of requirements (which you can treat
however you please, the goal being to turn it into a DataAccessTree
) and a
RunnableTask
which you can feed to execRunnableTask
once you have composed a
DataAccessTree
to feed it. Although note that this part of the API might
change a bit in future versions.
Can see where that comes from ^^, but nope, not all R.O.U.S.s are related. (And also, hedgehogs aren't rodents)
Although we do have a few tests using Hedgehog (and will possibly add more).