elyra-ai / elyra

Elyra extends JupyterLab with an AI centric approach.
https://elyra.readthedocs.io/en/stable/
Apache License 2.0
1.83k stars 343 forks source link

Shared storage between nodes running on local runtime - Unable to find loaded data in generic pipeline #2601

Open dogukanburda opened 2 years ago

dogukanburda commented 2 years ago

Describe the issue I am following the tutorial In Introduction to generic pipelines and Part 1 - Data Cleaning.ipynb in elyra-ai/examples is unable to find the data downloaded with load_data.ipynb included in the same tutorial in generic pipeline running on local runtime.

As far as I understand when I run the pipeline, load_data.ipynb is connected to a kernel deployed on a node that deployed by Jupyter Enterprise Gateway and it is able to download the data succesfully. But when the second notebook runs, its assigned to another node deployed by EG (Enterprise Gateway) and does not have a common file storage mounted, therefore is unable to find the data needed to proceed.

Doesn't these notebooks should run in environments that have shared storage mounted on each of them ?

Persistance storage for JupyterHub instances for each user works totally fine. But when it comes to running a pipeline their environments does not share any file resources.

To Reproduce Steps to reproduce the behavior:

  1. Create the generic pipeline with two fundamental nodes (load_data.ipynb and Part 1 - Data Cleaning.ipynb)
  2. Select runtime images on both node as Pandas 1.1.1
  3. Save and Run Pipeline
  4. See error

Screenshots or/and log output image Log:

Log Output
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/local/processor_local.py", line 233, in process
    papermill.execute_notebook(
  File "/opt/conda/lib/python3.9/site-packages/papermill/execute.py", line 122, in execute_notebook
    raise_for_execution_errors(nb, output_path)
  File "/opt/conda/lib/python3.9/site-packages/papermill/execute.py", line 234, in raise_for_execution_errors
    raise error
papermill.exceptions.PapermillExecutionError: 
---------------------------------------------------------------------------
Exception encountered at "In [5]":
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/usr/local/bin/kernel-launchers/python/scripts/launch_ipykernel.py in 
      1 raw_data = pd.read_csv('data/noaa-weather-data-jfk-airport/jfk_weather.csv',
----> 2                        parse_dates=['DATE'])
      3 raw_data.head()

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    446 
    447     # Create the parser.
--> 448     parser = TextFileReader(fp_or_buf, **kwds)
    449 
    450     if chunksize or iterator:

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    878             self.options["has_index_names"] = kwds["has_index_names"]
    879 
--> 880         self._make_engine(self.engine)
    881 
    882     def close(self):

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1112     def _make_engine(self, engine="c"):
   1113         if engine == "c":
-> 1114             self._engine = CParserWrapper(self.f, **self.options)
   1115         else:
   1116             if engine == "python":

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File data/noaa-weather-data-jfk-airport/jfk_weather.csv does not exist: 'data/noaa-weather-data-jfk-airport/jfk_weather.csv'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/local/processor_local.py", line 99, in process
    operation_processor.process(operation, elyra_run_name)
  File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/local/processor_local.py", line 241, in process
    raise RuntimeError(f'({file_name}) in cell {pmee.exec_count}: ' +
RuntimeError: (Part 1 - Data Cleaning.ipynb) in cell 5: FileNotFoundError [Errno 2] File data/noaa-weather-data-jfk-airport/jfk_weather.csv does not exist: 'data/noaa-weather-data-jfk-airport/jfk_weather.csv'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/tornado/web.py", line 1704, in _execute
    result = await result
  File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/handlers.py", line 120, in post
    response = await PipelineProcessorManager.instance().process(pipeline)
  File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/processor.py", line 134, in process
    res = await asyncio.get_event_loop().run_in_executor(None, processor.process, pipeline)
  File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/local/processor_local.py", line 104, in process
    raise RuntimeError(f'Error processing operation {operation.name} {str(ex)}') from ex
RuntimeError: Error processing operation Part 1 - Data Cleaning (Part 1 - Data Cleaning.ipynb) in cell 5: FileNotFoundError [Errno 2] File data/noaa-weather-data-jfk-airport/jfk_weather.csv does not exist: 'data/noaa-weather-data-jfk-airport/jfk_weather.csv'

data/ folder seen at the screenshot is generated by running load_data.py as a single node in pipeline and setup validation example in elyra-ai/examples is running just fine.

Expected behavior Pipeline should produce appropriate output in pipeline's working directory.

Deployment information Describe what you've deployed and how: Deployed Jupyter Enterprise Gateway

helm install --namespace enterprise-gateway enterprise-gateway https://github.com/jupyter-server/enterprise_gateway/releases/download/v2.6.0/jupyter_enterprise_gateway_helm-2.6.0.tgz

Deployed JupyterHub with elyra/elyra:3.6.0 oficial image with Helm v3 with following command

helm upgrade --cleanup-on-fail \
  --install jhub jupyterhub/jupyterhub \
  --namespace jupyter \
  --create-namespace \
  --version=1.2.0 \
  --values jupyter-elyra-config.yml

where jupyter-elyra-config.yml contains only the following :

singleuser:
  defaultUrl: "/lab"
  image:
    name: elyra/elyra
    # change to a specific release version as appropriated
    tag: 3.6.0
    # disable this in a production environment
    pullPolicy: "Always"
  storage:
    dynamic:
      storageClass: longhorn
  extraEnv:
    JUPYTER_GATEWAY_URL: http://192.168.122.243:8888   # EG's load balancer IP
    JUPYTER_GATEWAY_REQUEST_TIMEOUT: "120"

Pipeline runtime environment If the issue is related to pipeline execution, identify the environment where the pipeline is executed

Runtime configuration settings If the issue is related to pipeline execution, document the runtime configuration settings from the Elyra UI, omitting confidential information.

kevin-bates commented 2 years ago

You will likely need to configure mounts and mirrorWorkingDirs in your respective kernel-pod.yaml file for each applicable kernel spec.

dogukanburda commented 2 years ago

Thank you for your quick response.

Since kernel-pod.yaml is generated and deployed by enterprise-gateway pod I tried to deploy enterprise-gateway with KERNEL_VOLUME_MOUNT env variable defined. After a deep dive on enterprise-gateway's source It looked to me to give this variable in etc/kubernetes/helm/enterprise-gateway/templates/deployment.yaml file I couldnt gave the KERNEL_VOLUME_MOUNT and KERNEL_VOLUMES as env variables due to their structure not being plain strings.

How is it possible to give a yaml-array like value to an env variable ? Sadly couldnt find anything on google.

KERNEL_VOLUME_MOUNT=

  - name: userdir-pvc
    mountPath: "/mnt"

Most close I get is by defining a volumemounts value in etc/kubernetes/helm/enterprise-gateway/values.yaml file

volumemounts:
  - name: userdir-pvc
    mountPath: "/mnt"

and appending this in etc/kubernetes/helm/enterprise-gateway/templates/deployment.yaml file's env section

    - name: KERNEL_VOLUME_MOUNT
      value: {{  .Values.volumemounts }}

I get a yaml error obviously trying to define volumemounts as an array. Any help would be strongly appreciated.

And also Is it really a good choice to iterate over an env variable that has a custom structure such as this one ?

kevin-bates commented 2 years ago

This discussion/issue should be moved to the Enterprise Gateway repo as it has nothing to do with Elyra. That said, let me respond in an attempt to perhaps get you moving forward. Should there still be issues (which is likely), please open an issue in EG and we'll go from there.

First of all, I agree that this is a bit of a mess. Because there isn't a good way to parameterize kernel launches (which needs to span the entire Jupyter stack), the best we can do is flow environment variables as the parameters, which, yes, relegates us to encoding more complex types into strings (which is non-trivial) - particularly when the number of mounts themselves vary per user/kernel launch.

Most close I get is by defining a volumemounts value in etc/kubernetes/helm/enterprise-gateway/values.yaml file

These mounts will only apply to the EG pod and not each of the kernel pods. Instead, I recommend you make the necessary adjustment to the kernel-pod.yaml.j2 template that do NOT use KERNEL_ values and get mounts working for your kernel. Once you have that working, you should be able to replace the "varying" portion of that stanza, like the user's home directory, with a templated value (e.g., {{ kernel_home_dir }}) where KERNEL_HOME_DIR can be supplied from the client-side when the kernel is launched.

To make iterating over the kernel-pod changes easier, it is recommended that you mount the /usr/local/share/jupyter/kernels directory into your EG pod (and for which you can edit the helm chart files) so that edits can be made to the respective kernel-pod.yaml.j2 files located in each of the kernelspecs' scripts directories.

And also Is it really a good choice to iterate over an env variable that has a custom structure such as this one ?

No. Per my previous comment, it's all we have. When the number of mounts is consistent across users, I would use the fixed approach with variances described via envs, but when the requirement is that different users require different mounts entirely, then we have to take the "conditional" approach where the complete mount stanzas are encoded. Unless, of course, you have other ideas (which should be discussed over in the EG repo). Thanks.