05_03_Pipeline_mnist.ipynb Improvment

aws-samples / eks-kubeflow-workshop

Kubeflow workshop on EKS. Mainly focus on AWS integration examples. Please go check kubeflow website http://kubeflow.org for other examples

Apache License 2.0

96 stars 56 forks source link

05_03_Pipeline_mnist.ipynb Improvment #34

Closed PatrickXYS closed 4 years ago

PatrickXYS commented 4 years ago

Description of changes:

Given current 05_03_Pipeline_mnist.ipynb notebook, we want to improve from below points:

Replace original image with our image host
Add Dockerfile to indicate how we build the image
Add Tensorboard support
Find if we have specific tfop support for tfjob training without utilizing large manifest/yaml file.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Jeffwan commented 4 years ago

Address this issue. https://github.com/aws-samples/eks-kubeflow-workshop/issues/33

@PatrickXYS You may forget to check in the docker file and training code? for No. 1, 2, 3

How about 4, does high level TFJobOp make sense to you? https://github.com/kubeflow/pipelines/blob/master/components/kubeflow/launcher/sample.py

PatrickXYS commented 4 years ago

Hi @Jeffwan, I only replace the original image with mine, and will propose new version of commit to address No 2, 3, 4.

Jeffwan commented 4 years ago

@PatrickXYS Cool. I will leave this PR open and it makes more sense to have these solved together and include them in one PR.

PatrickXYS commented 4 years ago

The new version of PR is trying to address No.1, 2, 3, 4.

I'll still leave the PR open since below issues:

Simply replace tf-export-dir with s3 bucket won't work, since the internal logic didn't write training logs into s3 bucket. Need investigation tomorrow.
If we want to replace mnist training data with s3 mnist training data, this requires massive modification in model.py because original code utilize tf.contrib.learn.datasets.DATASETS['mnist'] function.

cfregly commented 4 years ago

@PatrickXYS It would be super-useful to have a version that allows a user to specify the input data as an S3 bucket (real world) vs. pulling from tf.datasets (not the real world).

Leaving this as a massive effort for users will greatly reduce the chances of transitioning from this sample to a real production use case.

PatrickXYS commented 4 years ago

@cfregly I totally agree with your idea. I'll try to solve this tomorrow, we should be able to see a new PR by the end of the tomorrow.

Jeffwan commented 4 years ago

@PatrickXYS Don't spend too much time on data preprocessing. Let's finish other low hanging issues first and then come back to issue. We should have at least one example to make user easily kick off pipeline without worrying about the dependencies.

Definitely we should have an example to pass training datasets.