Optimize shared dependency list among nodes to minimize upload local artifact dependency tar size

akchinSTC commented 4 years ago

Often a notebook(s) will share the same dataset / local file and we submit a copy for every notebook. We should optimize and factor out these common dependencies so we don't have to upload for every notebook.

kevin-bates commented 4 years ago

It seems like we'd need to walk the pipeline and build a list of all referenced dependencies, then create a tar file for each entry in the list and upload each. NotebookOp (i.e., ContainerOp) would then need to take a list of dependency archives that correspond to that particular operation.

One concern with this approach is that a dependency of "inputs.csv" for one Notebook node might be completely different than the "inputs.csv" for another Notebook node, but because dependencies are relative to the notebook, we really have no way to make that determination without, say, computing and comparing checksums or something to that effect.

akchinSTC commented 4 years ago

Link to use of s3 schema in pipeline inputs and outputs: https://github.com/kubeflow/pipelines/blob/8014a44229664ebd4f9b6ec69fbb6900f104af85/components/aws/sagemaker/tests/unit_tests/tests/test_batch_transform.py

Most if not all examples though seem specific to ones under AWS, which could be using a custom operator

elyra-ai / elyra

Optimize shared dependency list among nodes to minimize upload local artifact dependency tar size #577