elyra-ai / elyra

Elyra extends JupyterLab with an AI centric approach.
https://elyra.readthedocs.io/en/stable/
Apache License 2.0
1.8k stars 335 forks source link

Support (recursive) folders added as "output files" in a pipeline stage #1250

Closed romeokienzler closed 3 years ago

romeokienzler commented 3 years ago

Describe the issue I have a notebook creating a folder "data" which contains subfolders "train" and "val" which contain subfolders of images of different classes - this is a common folder/file structure used in image processing using deep learning

The subsequent (training) notebook/pipeline step needs access to all the images in the same folder structure

There must be an easy way to just add the "data" folder with all it's contents

Now I have to do the following:

data_small/train/male/.jpg data_small/train/female/.jpg data_small/val/male/.jpg data_small/val/female/.jpg

In addition, it doesn't work since subsequent (training) step wanting the read the data gets this error:

"HEAD /rkie/train-trusted-ai-0127102447/data_small/train/male/%2A.jpg HTTP/1.1" 404 0 note that the wildcard is not replaced

Screenshots or log output

"HEAD /rkie/train-trusted-ai-0127102447/data_small/train/male/%2A.jpg HTTP/1.1" 404 0 Traceback (most recent call last): File "bootstrapper.py", line 594, in main() File "bootstrapper.py", line 582, in main file_op.process_dependencies() File "bootstrapper.py", line 102, in process_dependencies self.get_file_from_object_storage(file.strip()) File "bootstrapper.py", line 281, in get_file_from_object_storage self.cos_client.fget_object(bucket_name=self.cos_bucket, File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 719, in fget_object stat = self.stat_object(bucket_name, object_name, sse) File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 1138, in stat_object response = self._url_open('HEAD', bucket_name=bucket_name, File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 2017, in _url_open raise ResponseError(response, minio.error.NoSuchKey: NoSuchKey: message: The specified key does not exist. Runtime execution graph. Only steps that are currently running or have already completed are shown.

Pipeline runtime environment

ptitzler commented 3 years ago

I can't find the issue anymore but I reported the same problem last year.

ptitzler commented 3 years ago

Do note that the approach we are currently using to make artifacts available in containers won't work well for large files or if there are many files specified implicitly or explicitly (via wildcard). The reason is that after a notebook | Python script is processed these files are uploaded to COS and then downloaded from there when a downstream node is processed. This approach can result in a lot of overhead. One way to work around the issue is to support volume mounts, which we've discussed before and need to revisit.

kevin-bates commented 3 years ago

I remember this discussion as well and also can't find the issue.

I agree with your last comment Patrick. Looking at the code, I think the issue is this:

Inputs to operations are derived from the outputs of previous nodes. We currently handle the upload of directory and wildcarded outputs to COS - albeit expensively, as Patrick points out. However, because the inputs relative to the current operation are also listed as directory/wildcards (since inputs are captured from the specified outputs of all previous operations), minio provides no way to download those items (and thus the encoding issue occurs). As a result, we'd need to interpret inputs that specify directories and/or wildcards individually and use minio's list_objects() method (which does allow for recursion) to build a superset of candidate files, then apply appropriate wildcarding to that list to determine which individual files to download.

If some form of mounting could be accomplished, such that regular filesystem methods could be applied, then, in essence, no input "downloading" would be required, and, realistically speaking, nor would the specification of output files be necessary - since all operations would be playing on the same field (i.e., mounted filesystem).

Projects would be an ideal vehicle for this kind of thing where the project specifies the working area.

romeokienzler commented 3 years ago

@kevin-bates if we could come up with something like project would be awesome - just a COS location all pods are mounting in (we've discussed COS/S3 mounting many times in different occasions)

also agree with @ptitzler done some tests and it is getting incredibly slow to pull 100K of files from COS in every pipeline step - what I'm doing now is zipping them up and unzip them in the notebook, nice workaround....so far...

kevin-bates commented 3 years ago

what I'm doing now is zipping them up and unzip them in the notebook, nice workaround....so far...

Hmm. Since all inputs are derived from all previous outputs, and if we automatically archived all outputs for a given operation (using say {operation_id}.tar as the name) and added that archive name (even though it wouldn't exist at that time) to the _next_operation's inputs, then I think we could essentially perform that logic automatically. Placeholder archives.

ptitzler commented 3 years ago

From a pipeline modeling point of view it currently might be advantageous in some scenarios to not perform data downloads (from external sources) in a separate node and instead download data in the node (if there's only one) that requires/processes the data.

kevin-bates commented 3 years ago

Wouldn't this require explicit (rather than implicit) input specifications?

not perform data downloads (from external sources) in a separate node

I'm not following the separate node portion of this comment.

lresende commented 3 years ago

Here is the old issue: https://github.com/elyra-ai/elyra/issues/995

ptitzler commented 3 years ago

I'm not following the separate node portion of this comment.

two nodes vs one:

node 1: does data download -> node 2: processes data vs node 1: does data download and processes data

lresende commented 3 years ago

Then I think this is the issue: https://github.com/elyra-ai/elyra/issues/1131

romeokienzler commented 3 years ago

I'd say #1131 and #995 have to be implemented eventually for best UX

Unnecessary downloads of data are a real issue, with 10 pipeline runs on a 1GB dataset I've already wasted 10GB of COS

Having experienced COS performance I'd rather hesitate to implement #1131 and #995 because each read to e.g. an image boils down to a HTTP request which happen initially in #995 or on request in #1131 ... in both cases the CPU/GPU will sit idle ... I guess pulling an archive from COS and unpacking it in the POD is actually faster

your take on this?

romeokienzler commented 3 years ago

Let us continue the discussion in #1131 and #995, closing...

lehrig commented 2 years ago

Any progress or ideas on issues related to this issue? I'm missing a general conceptual discussion on this...

Context: Tried to create a simple example where train/val data is structured in folders that contain multiple images, similar to @romeokienzler ’s use case. Hitting exceptions when doing so, even though Elyra UI implies working with folders/match patterns should work (that is bad! make it work or don’t list that feature…).

Proposals:

  1. Use explicit dependencies for inputs and outputs. This is crucial for Elyra’s general paradigm: structuring ML workflows into independent components. Implicit data fetching is bad for isolating components, given possible side effects and unnecessary data transfer. Alternatively, at least make input dependencies explicit (or derive them from a component's code) and only fetch data if actually needed.
  2. Fix NoSuchKey errors if working with patterns/folders & Minio.
  3. Make volume mounts work as a dedicated construct to exchange (big) data between components, or come up with a similar, dedicated concept.

Also note this Kubeflow example code: https://github.com/kubeflow/kfp-tekton/blob/master/samples/kfp-tfx/tfx-taxi-on-prem/TFX-taxi-sample-pvc.py

ptitzler commented 2 years ago

Make volume mounts work as a dedicated construct to exchange (big) data between components, or come up with a similar, dedicated concept.

That is the most likely route we'll probably take because the other solutions would not handle large data sets well. See https://github.com/elyra-ai/elyra/issues/1586

For the time being you'll unfortunately have to create a compressed archive and declare that as output file to overcome this limitation.

chanansh commented 2 months ago

The documentations says wildcards are supported... https://elyra.readthedocs.io/en/latest/user_guide/pipelines.html#output-files

A list of files generated by the notebook inside the image to be passed as inputs to the next step of the pipeline. Specify one file, directory, or expression per line. Supported patterns are and ?. Example: data/.csv

But it does not work as I get