Closed romeokienzler closed 3 years ago
I can't find the issue anymore but I reported the same problem last year.
Do note that the approach we are currently using to make artifacts available in containers won't work well for large files or if there are many files specified implicitly or explicitly (via wildcard). The reason is that after a notebook | Python script is processed these files are uploaded to COS and then downloaded from there when a downstream node is processed. This approach can result in a lot of overhead. One way to work around the issue is to support volume mounts, which we've discussed before and need to revisit.
I remember this discussion as well and also can't find the issue.
I agree with your last comment Patrick. Looking at the code, I think the issue is this:
Inputs to operations are derived from the outputs of previous nodes. We currently handle the upload of directory and wildcarded outputs to COS - albeit expensively, as Patrick points out. However, because the inputs relative to the current operation are also listed as directory/wildcards (since inputs are captured from the specified outputs of all previous operations), minio
provides no way to download those items (and thus the encoding issue occurs). As a result, we'd need to interpret inputs that specify directories and/or wildcards individually and use minio's list_objects()
method (which does allow for recursion) to build a superset of candidate files, then apply appropriate wildcarding to that list to determine which individual files to download.
If some form of mounting could be accomplished, such that regular filesystem methods could be applied, then, in essence, no input "downloading" would be required, and, realistically speaking, nor would the specification of output files be necessary - since all operations would be playing on the same field (i.e., mounted filesystem).
Projects would be an ideal vehicle for this kind of thing where the project specifies the working area.
@kevin-bates if we could come up with something like project would be awesome - just a COS location all pods are mounting in (we've discussed COS/S3 mounting many times in different occasions)
also agree with @ptitzler done some tests and it is getting incredibly slow to pull 100K of files from COS in every pipeline step - what I'm doing now is zipping them up and unzip them in the notebook, nice workaround....so far...
what I'm doing now is zipping them up and unzip them in the notebook, nice workaround....so far...
Hmm. Since all inputs are derived from all previous outputs, and if we automatically archived all outputs for a given operation (using say {operation_id}.tar as the name) and added that archive name (even though it wouldn't exist at that time) to the _next_operation's inputs, then I think we could essentially perform that logic automatically. Placeholder archives.
From a pipeline modeling point of view it currently might be advantageous in some scenarios to not perform data downloads (from external sources) in a separate node and instead download data in the node (if there's only one) that requires/processes the data.
Wouldn't this require explicit (rather than implicit) input specifications?
not perform data downloads (from external sources) in a separate node
I'm not following the separate node portion of this comment.
Here is the old issue: https://github.com/elyra-ai/elyra/issues/995
I'm not following the separate node portion of this comment.
two nodes vs one:
node 1: does data download -> node 2: processes data vs node 1: does data download and processes data
Then I think this is the issue: https://github.com/elyra-ai/elyra/issues/1131
I'd say #1131 and #995 have to be implemented eventually for best UX
Unnecessary downloads of data are a real issue, with 10 pipeline runs on a 1GB dataset I've already wasted 10GB of COS
Having experienced COS performance I'd rather hesitate to implement #1131 and #995 because each read to e.g. an image boils down to a HTTP request which happen initially in #995 or on request in #1131 ... in both cases the CPU/GPU will sit idle ... I guess pulling an archive from COS and unpacking it in the POD is actually faster
your take on this?
Let us continue the discussion in #1131 and #995, closing...
Any progress or ideas on issues related to this issue? I'm missing a general conceptual discussion on this...
Context: Tried to create a simple example where train/val data is structured in folders that contain multiple images, similar to @romeokienzler ’s use case. Hitting exceptions when doing so, even though Elyra UI implies working with folders/match patterns should work (that is bad! make it work or don’t list that feature…).
Proposals:
Also note this Kubeflow example code: https://github.com/kubeflow/kfp-tekton/blob/master/samples/kfp-tfx/tfx-taxi-on-prem/TFX-taxi-sample-pvc.py
Make volume mounts work as a dedicated construct to exchange (big) data between components, or come up with a similar, dedicated concept.
That is the most likely route we'll probably take because the other solutions would not handle large data sets well. See https://github.com/elyra-ai/elyra/issues/1586
For the time being you'll unfortunately have to create a compressed archive and declare that as output file to overcome this limitation.
The documentations says wildcards are supported... https://elyra.readthedocs.io/en/latest/user_guide/pipelines.html#output-files
A list of files generated by the notebook inside the image to be passed as inputs to the next step of the pipeline. Specify one file, directory, or expression per line. Supported patterns are and ?. Example: data/.csv
But it does not work as I get
Describe the issue I have a notebook creating a folder "data" which contains subfolders "train" and "val" which contain subfolders of images of different classes - this is a common folder/file structure used in image processing using deep learning
The subsequent (training) notebook/pipeline step needs access to all the images in the same folder structure
There must be an easy way to just add the "data" folder with all it's contents
Now I have to do the following:
data_small/train/male/.jpg data_small/train/female/.jpg data_small/val/male/.jpg data_small/val/female/.jpg
In addition, it doesn't work since subsequent (training) step wanting the read the data gets this error:
"HEAD /rkie/train-trusted-ai-0127102447/data_small/train/male/%2A.jpg HTTP/1.1" 404 0 note that the wildcard is not replaced
Screenshots or log output
"HEAD /rkie/train-trusted-ai-0127102447/data_small/train/male/%2A.jpg HTTP/1.1" 404 0 Traceback (most recent call last): File "bootstrapper.py", line 594, in
main()
File "bootstrapper.py", line 582, in main
file_op.process_dependencies()
File "bootstrapper.py", line 102, in process_dependencies
self.get_file_from_object_storage(file.strip())
File "bootstrapper.py", line 281, in get_file_from_object_storage
self.cos_client.fget_object(bucket_name=self.cos_bucket,
File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 719, in fget_object
stat = self.stat_object(bucket_name, object_name, sse)
File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 1138, in stat_object
response = self._url_open('HEAD', bucket_name=bucket_name,
File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 2017, in _url_open
raise ResponseError(response,
minio.error.NoSuchKey: NoSuchKey: message: The specified key does not exist.
Runtime execution graph. Only steps that are currently running or have already completed are shown.
Pipeline runtime environment