Investigate applicability of workflow execution systems

In the comments for PR #4, discussions arose about the possible relationships between the task management system and existing workflow frameworks, and whether existing general workflow execution systems could be adapted to Data Together's needs in order to avoid having to reinvent wheels. We decided the discussion should be moved to a separate issue, and this is it.

The comment I made was the following:

This sounds like a workflow, and the underlying execution system a workflow application or workflow framework. Is it that, or is that going too far?
How does it relate to whole frameworks such as Airflow and/or Celery? Could we see extending Data Together's task execution to use an existing framework (thus reducing wheel reinvention)?
A nice thing about analogies to workflow systems is that there are GUIs for such things, and maybe they could be adapted for Data Together. I have in mind things that provide graphical interfaces like this, although that one's a desktop application. I think there are SDK for building GUIs like that; for instance, viewflow.
At some point, as Data Together grows, its task execution model is bound to grow as well. (C.f. Zawinski's law.) It will need an interpreter for the task execution language. Looking ahead at that, it may be worth keeping an eye on examples that could either serve as a template (e.g., CWL? or YAWL?) or as examples of what to avoid (not to disparage any particular effort, but something like BPEL would probably be overkill).

In retrospect, re-reading the whole list of comments now, it becomes more clear that @kelson42 was already making essentially the same point using the specific case of zimfarm.

@flyingzumwalt replied 2017-07-06 to @kelson42 thus:

@kelson42 the pattern does look very similar! Wonderful. zimfarm looks like a task management tool specifically for zim files. datatogether is aimed at establishing a pattern for any community to replicate, manage, annotate, and process any data, using decentralized storage patterns. this datatogether/task-mgmt repo is providing some of the tooling to support that pattern. It will be great if we can cross-pollinate between the two projects.

There are lots of motivations for using task-mgmt with all sorts of other data that have nothing to do with wikipedia, but the two main motivations for using task-mgmt with wikipedia zim dumps are:

write the wikipedia dumps to IPFS modify the data as part of the harvest, for example applying the disclaimer and brand changes applied by https://github.com/ipfs/distributed-wikipedia-mirror/blob/master/execute-changes.sh Will it be possible to do those two things with zimfarm?

@kelson42 replied on 2017-07-06:

@flyingzumwalt

The worker part of zimfarm is based on Docker. A job/task is basically a:

docker image name list of instructions (bash) So I tend to say yes. Might really make sense to share the whole scheduling part of the solution... then everybody can build its own Docker images and jobs to do whatever they want.

@dcwalk later asked on 2017-07-09:

still wrapping my head around the task types, could you unpack "perform tasks listed by a Github Repo" a little bit? (I think you mean https://github.com/datatogether/task-mgmt/blob/task_pipeline/taskdefs/ipfs/github_add.go ?)

@b5 replied on 2017-07-10:

Apologies, that's very vague phrasing, mainly b/c it's unfinished work. I do mean the bit in taskdefs/ipfs/github_add.go. What this is to say is we can have a task that looks for special sets of instructions in a GitHub repo & performs them�. "Special instructions" could be a dockerfile with a CMD entry, or they could be a foreman procfile. I mention Github because we can incorporate GitHub permissions into the task workflow. Because this amounts to arbitrary code execution, we'll need to be very careful about how we set up who can & can't initiate this type of task, and we can use Github to scope these tasks to things like "only users who have write access to repo x have permission to initiate this task". [...] What a task could be is intentionally vague. I'm currently thinking about tasks as repeatable actions that transition content to the distributed web. This includes moving things onto IPFS, but also everything from the world of metadata, and the list of different task types from above. Any of these tasks can take an arbitrary amount of time, which is why we want to queue them up.

The task/taskable naming is, well, awful. Taskable is supposed to say "hey, if you want to be considered an action that we can work with, you'll need these methods". I'm hoping to improve on the naming in the future. The first place to start may be to rename Task to TaskStatus, and make Taskable the principle Task interface, because satisfying the Taskable interface is the most important thing to do from a dev perspective.

After that, I had posted the comment at the beginning of this issue. @dcwalk replied to strongly agree with items 2 & 4. Then in turn, @Igierth replied thusly:

Strongly agree that there's very likely some existing software that matches the requirements, and helps avoid reinventing the wheel.

Current CI (continuous integration) systems like Jenkins might also be worth a look. They come with:

• all kinds of input (e.g. github webhooks) and output (artifact files, storage systems) adapters • build steps and dependencies (pipelines) • workers and spawning them on-demand • big communities of developers and sysadmin with experience running/developing them

@flyingzumwalt replied on 2017-07-11:

I agree with the inclination to avoid reinventing wheels. The key here, with respect to datatogether, is that we want to encourage ongoing experimentation in this domain. This experimentation should be permissionless (anyone can cook up a new solution and share it with the world), and loosely coordinated (if you have an idea, you share it with collaborators and try to build on existing momentum where possible).

Right now there are at least two interesting experiments within the data rescue domain:

data rescue boston have produced a compelling tool -- a library that works with morph.io @b5 has written this proof of concept, which is immediately useful for tasks we need to manage. The most compelling aspect of the work in this current PR is the pattern of using Pull Requests (on github) as a point of quality control and security review before tasks get modified. This allows us to rely on the existing transparency of git Pull Requests and community patterns around github PRs to ensure that the code (and docker containers, etc) that we use are safe, repeatable, and maintained in a transparent fashion. I think this is a very compelling pattern to explore.

It's definitely worth considering DAG-based workflow automation tools like Airflow, Celery, etc. Jenkins is also a good option to consider for the mid-to-long term. If we adopt tools like that, the main thing to carry over from the current proof of concept is this quality-control-via-PRs pattern.

He also rightly pointed out that the PR shouldn't be held up by long-term considerations on the topic of workflow systems, at which point, we discussed splitting out the comments into this separate issue.

And here we are.

datatogether / task_mgmt

Investigate applicability of workflow execution systems #6