Task pipeline - Githubissues

b5 commented 7 years ago

This is an overhaul of the task-mgmt, transitioning it from a one-off proof-of-concept to an extensible backend service.

This service started as a standalone example of a single workflow:

Use datatogether/identity to log in with GitHub
If the user has write permissions for a specified GitHub repo, it would give them the option to request the repo be executed against a resource (in this case kiwix dumps of wikipedia).
The service would email the relevant parties, asking them to manually perform the task.

So, uh, it didn't do much, but it did set the stage for authenticated task-initiation, which remains a big area in need of development.

This PR Changes task-mgmt to become a service oriented around tasks on a queue, and introduces an interface for writing new kinds of tasks, extending the capabilities of data together over time. Since starting to work with this task-oritented pattern, I've come to believe the majority of work we've been doing in gov archiving work are large, human versions of this pattern, and this gives us a way of expressing those tasks in code, which makes for very exciting potential.

So, breaking the concepts in this PR down:

tasks are any kind of work that needs to get done, but specifically work that would take longer, than say, a web request/response cycle should take. An example of a task can be "put this url on IPFS". another might be "identify the filetype of these 30,000 files, putting the results in a database".

Because this work will take anywhere from milliseconds to days, and may require special things to do that work, it makes sense to put those tasks in a queue, which is just a giant, rolling list of things to do, and have different services be able to add tasks to the queue, and sign up to do tasks as they hit the queue. This PR introduces a first-in-first-out (FIFO) queue to start with, meaning the first thing to get added is the first thing to get pulled off a list.

The queue itself is a server, specifically a rabbitmq sever, it's open source, and based on the open amqp protocol. This means that things that work with the queue don't necessarily need to be written in go. More on that in the future.

The task-mgmt service does just what it says on the tin. It's main job is to manage not just tasks, but the state of tasks as they move through the queue, questions like "what tasks are currently running?" are handled with this PR. As tasks are completed task-mgmt updates records of when tasks started, stopped, etc.

this PR removes all user interfaces and instead introduces both a JSON api and an remote procedure call (RPC) api, the RPC api will be used to fold all of task-mgmt into the greater datatogether api. I know, that's the word api a million times, basically this means we'll have a PR on the datatogether/api to expose tasks so that outside users will access tasks the same way they access, say, coverage, or users. Only internal services will need to use the task-mgmt JSON api as a way of talking across languages.

All of these changes turn the task-mgmt server into a backend service, so that we can fold all task stuff into the existing frontend. This means once the UI is written you'll be able to view, create, and track the progress of tasks from the standard webapp. PR on datatogether/context to follow.

Along with tracking tasks, task-mgmt both add to and reads from the queue. This might seem strange, but it makes for a more simple starting point. Later on down the road lots of different services may be registered to accept tasks from the queue, at which point we can transition task-mgmt to a role of just adding to the queue and tracking progress.

But most importantly of all, this PR also introduces a new subpackage task-mgmt/tasks which outlines the initial interface for a task definition, which is the platform on which tasks can be extended to all sorts of things. Getting this interface right is going to take some time, so I'd like to take some time to write an initial round of task-types, and then re-evaluate the interface. those initial task-types are:

add a url to IPFS (included in this PR as an example & initial test-case)
calculating coverage statistics
submit a url for archiving to a set of webarchiving services: archive.org, perma.cc
checking various archiving services for their records of a given url
perform morph.io run (what up @jeffreyliu)
perform tasks listed by a Github Repo
ingest a kiwix zim file to an IPFS node
send an email using postmark transactional mail service, (for requesting an IPNS repoint)
combine two or more of the above tasks into a single "meta" task

This list of task-types is aimed at the high-priority needs from the community. Combining the first 4 task-types with the soon-to-land collections feature gives us everything we need to satisfy some latent EDGI needs (what up @titaniumbones), morph.io runs connect us to the work team boston has been up to, and the rest are for the Protocol Labs Wikipedia-on-IPFS project (what up @flyingzumwalt ). I'm hoping to land all these in a series of PR's in the next 10 days. Once those are landed we'll have to put some sort of permissions-based barriers in place to dicate who is allowed to create what sorts of tasks, that will be a job for a different service.

The next round of task-types can/might include:

collection collation (crawling link graphs to find related groups of content)
simhash / minhash diffing calculations for a timeseries of a url
apply FITS to identify file types
de-duplicate WARC files
metadata extraction (what up @mhucka)
crawling WARC files onto IPFS
kicking off entire web crawls
IPFS cluster orchestration

From a programmer-participation perspective, we can heavily document how defining a task works, and this will provide a great way for devs to extend the datatogether platform to do things we haven't yet thought of. Lots of ideas for new task-types come up from places like the ipfs/archives repo.

I'd like to get this merged in order to get working on surfacing tasks within the webapp & API, but discussing the merits of this approach / potential alternatives are in no way off the table. Also, feel free to drop questions, as I'll work them into the readme!

flyingzumwalt commented 7 years ago

cc @lgierth @kubuxu @whyrusleeping

flyingzumwalt commented 7 years ago

Exciting! It would be great to have a code snippet that shows how you would configure a pipeline and run it. For example, what would I need to do to set up a process to

Pull the ZIM dump of english wikipedia from http://download.kiwix.org/zim/wikipedia_en_all.zim onto ipfs
Run these steps on it
Email the new hashes (unprocessed zim dump and processed version) to the maintainers of the distributed-wikipedia-mirror project

Note: in real-world scenario, you also need to figure out strategies for pinning, unpinning and garbage collecting ipfs content from these processes -- need to keep it pinned long enough for people to replicate the results onto the destination machines, but don't want to have all the content accumulating on servers that are set up for ephemeral process runs.

kelson42 commented 7 years ago

It seems we are currently setting-up a pretty similar infrastructure for creating ZIM files :( https://github.com/openzim/zimfarm

flyingzumwalt commented 7 years ago

@kelson42 the pattern does look very similar! Wonderful. zimfarm looks like a task management tool specifically for zim files. datatogether is aimed at establishing a pattern for any community to replicate, manage, annotate, and process any data, using decentralized storage patterns. this datatogether/task-mgmt repo is providing some of the tooling to support that pattern. It will be great if we can cross-pollinate between the two projects.

There are lots of motivations for using task-mgmt with all sorts of other data that have nothing to do with wikipedia, but the two main motivations for using task-mgmt with wikipedia zim dumps are:

write the wikipedia dumps to IPFS
modify the data as part of the harvest, for example applying the disclaimer and brand changes applied by https://github.com/ipfs/distributed-wikipedia-mirror/blob/master/execute-changes.sh

Will it be possible to do those two things with zimfarm?

kelson42 commented 7 years ago

@flyingzumwalt

The worker part of zimfarm is based on Docker. A job/task is basically a:

docker image name
list of instructions (bash)

So I tend to say yes. Might really make sense to share the whole scheduling part of the solution... then everybody can build its own Docker images and jobs to do whatever they want.

dcwalk commented 7 years ago

@b5 -- sounds neat, still wrapping my head around the task types, could you unpack "perform tasks listed by a Github Repo" a little bit? (I think you mean https://github.com/datatogether/task-mgmt/blob/task_pipeline/taskdefs/ipfs/github_add.go ?)

I guess I'm trying to imagine the range of actions (sorry wrong vocab) subsumed within a task and looking at the code I'm positive I'm not parsing marshalling/taskable/task correctly.

b5 commented 7 years ago

@flyingzumwalt

Exciting! It would be great to have a code snippet that shows how you would configure a pipeline and run it. For example, what would I need to do to set up a process to:

Pull the ZIM dump of english wikipedia from http://download.kiwix.org/zim/wikipedia_en_all.zim onto ipfs

Run these steps on it

Email the new hashes (unprocessed zim dump and processed version) to the maintainers of the distributed-wikipedia-mirror project

Loud & clear on the example, I'll work on documenting one. In the context of this PR accomplishing this task would amount to submitting a pull request to this repo that explicitly lists the task, which users can then initiate from datatogether.org (so long as they have the right permissions). In the future it may be possible for users to compose disparate tasks into chains-of-tasks from the frontend, but that sounds complicated.

Note: in real-world scenario, you also need to figure out strategies for pinning, unpinning and garbage collecting ipfs content from these processes -- need to keep it pinned long enough for people to replicate the results onto the destination machines, but don't want to have all the content accumulating on servers that are set up for ephemeral process runs.

Yes. I'd love to chat more about this one. My initial thought was to store the data-intensive results of these tasks in some s3-like thing & mounting this as the volume ephemeral IPFS nodes read from, but I'd like to learn more about IPFS cluster, and the thing together about long term planning of this infra. Especially as it relates to the un-finalized thing that member institutions & users download to participate in holding "data together data".

@dcwalk

still wrapping my head around the task types, could you unpack "perform tasks listed by a Github Repo" a little bit? (I think you mean https://github.com/datatogether/task-mgmt/blob/task_pipeline/taskdefs/ipfs/github_add.go ?)

Apologies, that's very vague phrasing, mainly b/c it's unfinished work. I do mean the bit in taskdefs/ipfs/github_add.go. What this is to say is we can have a task that looks for special sets of instructions in a GitHub repo & performs them. "Special instructions" could be a dockerfile with a CMD entry, or they could be a foreman procfile. I mention Github because we can incorporate GitHub permissions into the task workflow. Because this amounts to arbitrary code execution, we'll need to be very careful about how we set up who can & can't initiate this type of task, and we can use Github to scope these tasks to things like "only users who have write access to repo x have permission to initiate this task".

I guess I'm trying to imagine the range of actions (sorry wrong vocab) subsumed within a task and looking at the code I'm positive I'm not parsing marshalling/taskable/task correctly.

What a task could be is intentionally vague. I'm currently thinking about tasks as repeatable actions that transition content to the distributed web. This includes moving things onto IPFS, but also everything from the world of metadata, and the list of different task types from above. Any of these tasks can take an arbitrary amount of time, which is why we want to queue them up.

The task/taskable naming is, well, awful. Taskable is supposed to say "hey, if you want to be considered an action that we can work with, you'll need these methods". I'm hoping to improve on the naming in the future. The first place to start may be to rename Task to TaskStatus, and make Taskable the principle Task interface, because satisfying the Taskable interface is the most important thing to do from a dev perspective.

@kelson42 Great to see you're doing similar stuff! If there are things we could to do make this service more useful for your needs, I'd love to hear about it! I'll also keep an eye on zimfarm & see if we can't use your code once we find our legs!

mhucka commented 7 years ago

I had a similar question to @dcwalk's about the tasks in a GitHub repo – thanks for asking, and @b5 for answering. I was also wondering: would it make sense to allow tasks to be written in Gists?

mhucka commented 7 years ago

Stepping back for a moment, a few things jump to mind when I read "task execution system":

This sounds like a workflow, and the underlying execution system a workflow application or workflow framework. Is it that, or is that going too far?
How does it relate to whole frameworks such as Airflow and/or Celery? Could we see extending Data Together's task execution to use an existing framework (thus reducing wheel reinvention)?
A nice thing about analogies to workflow systems is that there are GUIs for such things, and maybe they could be adapted for Data Together. I have in mind things that provide graphical interfaces like this, although that one's a desktop application. I think there are SDK for building GUIs like that; for instance, viewflow.
At some point, as Data Together grows, its task execution model is bound to grow as well. (C.f. Zawinski's law.) It will need an interpreter for the task execution language. Looking ahead at that, it may be worth keeping an eye on examples that could either serve as a template (e.g., CWL? or YAWL?) or as examples of what to avoid (not to disparage any particular effort, but something like BPEL would probably be overkill).

mhucka commented 7 years ago

Unrelated to the above, could you also unpack "crawling WARC files onto IPFS"?

dcwalk commented 7 years ago

@mhucka -- great points! Agree strongly with 2 & 4 :)

ghost commented 7 years ago

Hey all, @flyingzumwalt asked if I had some input too.

Strongly agree that there's very likely some existing software that matches the requirements, and helps avoid reinventing the wheel.

Current CI (continuous integration) systems like Jenkins might also be worth a look. They come with:

all kinds of input (e.g. github webhooks) and output (artifact files, storage systems) adapters
build steps and dependencies (pipelines)
workers and spawning them on-demand
big communities of developers and sysadmin with experience running/developing them

(hi @dcwalk o/ we met through toronto meshnet a few times)

flyingzumwalt commented 7 years ago

I agree with the inclination to avoid reinventing wheels. The key here, with respect to datatogether, is that we want to encourage ongoing experimentation in this domain. This experimentation should be permissionless (anyone can cook up a new solution and share it with the world), and loosely coordinated (if you have an idea, you share it with collaborators and try to build on existing momentum where possible).

Right now there are at least two interesting experiments within the data rescue domain:

data rescue boston have produced a compelling tool -- a library that works with morph.io
@b5 has written this proof of concept, which is immediately useful for tasks we need to manage.

The most compelling aspect of the work in this current PR is the pattern of using Pull Requests (on github) as a point of quality control and security review before tasks get modified. This allows us to rely on the existing transparency of git Pull Requests and community patterns around github PRs to ensure that the code (and docker containers, etc) that we use are safe, repeatable, and maintained in a transparent fashion. I think this is a very compelling pattern to explore.

It's definitely worth considering DAG-based workflow automation tools like Airflow, Celery, etc. Jenkins is also a good option to consider for the mid-to-long term. If we adopt tools like that, the main thing to carry over from the current proof of concept is this quality-control-via-PRs pattern.

In the meantime let's merge this PR. I don't want long-term considerations to prevent us from landing a proof of concept. Instead we should ship the proof and use it to spur conversation about what should come next.

Previously this code base (which was running as alpha.archivers.space), relied on administrators, aka @b5 and team, to either manually run tasks or set up chron jobs on a server. This PR is a great improvement over that.

What this PR does:

creates a task management tool that uses a queue, thus automating a bunch of stuff that was previously manual while also giving more visibility into the running processes
establishes some of the "task" pattern(s) that are relevant for data together -- especially migrating data to decentralized infrastructure and/or managing data that's already on decentralized infra
makes it possible for communities to use github PRs to manage code and containers that will be used in tasks
gives a basic "taskable" UI that's built around our use cases

If nobody objects, I will merge this PR tomorrow.

flyingzumwalt commented 7 years ago

I should also spin off some GH issues to follow up on the ideas that people surfaced in this thread, so they don't get lost when we close the PR.

mhucka commented 7 years ago

I agree with @flyingzumwalt. In retrospect, I think my comments about existing workflow systems should have gone elsewhere (maybe a separate issue) as they are a bit of a derail w.r.t. this particular PR :-). Sorry about that.

mhucka commented 7 years ago

LOL while I was writing my comments, the last comment by @flyingzumwalt popped up just as I clicked on the green button. Talk about timing.

mhucka commented 7 years ago

I'm willing to start a new issue for this comment of mine, but am unsure how best to copy the content from @flyingzumwalt and @lgierth's follow-ups into the new issue. Do people have preferences or suggestions?

flyingzumwalt commented 7 years ago

I'd just create an issue around something like "build on existing task management tools" and mark with the "enhancement" label.

As far as capturing comments from me and @lgierth you can either quote and cite (example: https://github.com/ipfs/in-web-browsers/issues/7) or you can just cc us and let us add our own comments to the thread.

mhucka commented 7 years ago

OK, you most excellent people, I created an issue per our discussion upthread, and opted to use the quote-and-cite approach because it seemed the most likely to fully document how we got there.

dcwalk commented 7 years ago

rejoining a little late: hey again o/ @lgierth :))

datatogether / task_mgmt

Task pipeline #4