Update mapreduce-pipeline example

ca-scribner commented 4 years ago

Change refactors the mapreduce-pipeline kubeflow example for clarity and to use current best practices

Changes include:

Adding to descriptions around what was being done and why
Removing older references on how to use minio (they pass credentials via environment variables, then stored in a now derelict minio repo).
Rename map.sh and reduce.sh to sample.sh and average.sh to try and make the narrative clearer to readers that aren't familiar with the subject matter
Refactor docker shell scripts to make them optionally use minio or take inputs directly from command line

A second related notebook will be created later showing this same use case but with minio for storage (using current best practices).

ca-scribner commented 4 years ago

(deleted my previous comments and rewrote them because they were dumb :) )

Yeah agreed the argument passing is awkward. I don't like it either.

If we remove minio/mc then the example becomes really nice and simple. That would make it easier to maintain and cleaner.

The downside would be that it makes the example very trivial. Because of where the data goes when kubeflow handles the output file's itself, I don't think we actually want people doing that. The output from that kubeflow-managed method is persisted indefinitely in a public but kind of hidden minio store. I should add a note to the demo saying "this is a trivial example, please see X for production-ready ways to transfer results back".

I think for real examples we still need either some mc commands (or python equivalent) in the container. But we can sort that out in minio-specific examples

blairdrummond commented 4 years ago

Because of where the data goes when kubeflow handles the output file's itself, I don't think we actually want people doing that. The output from that kubeflow-managed method is persisted indefinitely in a public but kind of hidden minio store. I should add a note to the demo saying "this is a trivial example, please see X for production-ready ways to transfer results back".

I think for real examples we still need either some mc commands (or python equivalent) in the container. But we can sort that out in minio-specific examples

I think/hope (CC @sylus ) that we can maybe make the pipelines default to load their artifacts to/from a user's bucket in a nicer multi-tenant way. This might rely on something upstream though. Maybe @sylus can provide an opinion?

I think the best case scenario would be:

The pipelines dump artifacts into a user's personal storage by default.
If they want the data put somewhere specific, they should be able to use mc mv from their notebook (or use a comparable python function), to move the data to the folder they want if this can't already be done by kubeflow.

I think the moral should be

"containers are atomic things unaware of their context."

and

"orchestration --including storage destination-- happens from the notebook, without the container knowing"

blairdrummond commented 4 years ago

I think it's ok for the pipeline to be nearly trivial. It shows how to do a simple parallel batch job, and that's made trivial then that's a win in my books. Loading the data to a specific location can maybe be a second, complementary notebook.

Btw w.r.t. your Json parsing stuff, I'd be ok with a helpers.sh file which handles the parsing and provides a simple API. As long as it's obvious which file the user is supposed to read, so that it's obvious to them what they need to change and what they can ignore.

ca-scribner commented 4 years ago

@blairdrummond ok how about this setup. Pulled everything minio/mc out. I think it reads a lot simpler. Thoughts?

blairdrummond commented 4 years ago

It looks pretty good; I would just make two changes:

I think this code is overkill

    # Assemble arguments passed to the script called by the container op
    arguments = []
    if jsons:
        json_args = list(itertools.chain.from_iterable([("--json", j) for j in jsons]))
        arguments += json_args

I think it hides what the code is actually doing in this example. I think it should go back to

        arguments=[
            '--params', params,
            '--output', output,
        ],

It doesn't have to be so generic, I think it's more valuable for it to be readable. The changes will have to be reflected in the underlying shell scripts too, of course.

Speaking of which, I think we should switch the shell scripts to python. I think people will find it more familiar, and we can give examples with argparse that way.

If those changes come in, then I think it's definitely ready. (2. is optional)

blairdrummond commented 4 years ago

Also, can we ditch utilities.py and make the validate functions one-liners with lambdas and # names must conform to this regex comment? That will make the notebook a little more portable.

ca-scribner commented 4 years ago

Duplicate of #21

StatCan / aaw-contrib-jupyter-notebooks

Update mapreduce-pipeline example #17