elyra-ai / elyra

Elyra extends JupyterLab with an AI centric approach.
https://elyra.readthedocs.io/en/stable/
Apache License 2.0
1.86k stars 344 forks source link

Investigate removal of remote file dependencies in generic component operators #2805

Open ptitzler opened 2 years ago

ptitzler commented 2 years ago

Is your feature request related to a problem? Please describe. The current implementations of the generic component operators [1][2] download several files from GitHub that are required to prepare and execute a notebook or script in the Kubeflow Pipelines or Apache Airflow runtime environment. This introduces an additional runtime environment dependency and potential point of failure (if the files are downloaded from the Elyra github repository) or additional setup requirements (if 'local' versions of these files are used, as outlined in [3]).

I'm not familiar with the original motivation to take this approach, but suspect (please correct me if I am wrong) that it might be a result of the fact that the file dependencies were distributed across multiple repositories. It might be worthwhile to investigate if it is possible to remove these remote file dependencies.

Describe the solution you'd like Since these files are present in the environment where the pipeline is prepared for export/submission, it might be possible to bundle and upload them to object storage (same location as the node artifacts) and retrieve them from there at runtime. Potential benefits include:

[1] https://github.com/elyra-ai/elyra/blob/main/elyra/kfp/operator.py#L195-L200 [2] https://github.com/elyra-ai/elyra/blob/main/elyra/airflow/operator.py#L100-L104 [3] https://elyra.readthedocs.io/en/latest/recipes/running-elyra-in-air-gapped-environment.html

ptitzler commented 2 years ago

Adding kubernetes sidecar container pattern as another potential approach to investigate, as suggested by @akchinSTC during today's dev meeting.

The general idea (or at least my interpretation of it) is that Elyra could provide an init container that makes the required file dependencies available via a shared emptyDir. This approach would introduce a bit of maintenance overhead because we would have to publish versions of this container image on DockerHub et al. In air-gapped deployments administrators would also have to make this image available because access to public image registries might not be possible.

shalberd commented 1 year ago

Extremely valuable to to remove those remote file depedencies for airgapped systems, too, cause otherwise, things like custom CA trust when downloading from an https server, need for authentication (think Artifactory with an SSL certificate based on Enterprise PKI, for example) and proxy support all come into play. Those are not covered in the code with curl.

https://github.com/elyra-ai/elyra/blob/466bf0ff005a3f49fac38716cef59f695d0d92ec/elyra/pipeline/kfp/processor_kfp.py#L1119

plus, for adding env variables with a spawner, e.g. Jupyterlab in Open Data Hub or the new ODH dashboard controller, that is tough to accomplish.

Best really would be to have the files

https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/kfp/pip.conf
https://raw.githubusercontent.com/elyra-ai/elyra/main/elyra/kfp/bootstrapper.py
https://raw.githubusercontent.com/elyra-ai/elyra/main/elyra/airflow/bootstrapper.py
https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra-py37.txt
https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra.txt

present in Elyra container itself.

shalberd commented 1 year ago

Same issue with authentication and CA trust also when it comes to e.g. Airflow DAG repo location

GitHub API Endpoint (github_api_endpoint) Good thing is authentication is covered there (github_personal_access_token), but again, custom CA trust is not when dealing with python Github package and github_repository.create_file

https://github.com/elyra-ai/elyra/blob/8df85b9c3319f16599ea671781ce77b92393be82/elyra/util/github.py#L64

or gitlab

https://github.com/elyra-ai/elyra/blob/8df85b9c3319f16599ea671781ce77b92393be82/elyra/util/gitlab.py

What if the Gitlab is e.g. running inside a company, but has a non-publicly-trusted certificate based on an enterprise SSL CA?

e.g. for Gitlab

gl = gitlab.Gitlab(remote_host_url, token, api_version=4, ssl_verify=os.environ['REQUESTS_CA_BUNDLE'])

There is a way to get an openshift cluster network operator bundle file combining both additional trusted CAs as well as system-level public CAs into one bundle file via configmap. The Cluster Network Operator merges the user-provided and system CA certificates into a single bundle

apiVersion: v1
data: {}
kind: ConfigMap
metadata:
  labels:
    config.openshift.io/inject-trusted-cabundle: "true"
  name: trusted-ca-bundle 
  namespace: opendatahub

, then mounting it into the spawned container to a specific location and referring to it (the whole path to the file incl. filename) in env variable REQUESTS_CA_BUNDLE.

But even if we had REQUESTS_CA_BUNDLE set, the question is whether the python clients for Gitlab and Github honor it without explicit changes to the code.

https://github.com/python-gitlab/python-gitlab/issues/352

https://github.com/PyGithub/PyGithub/issues/583

Taken together, this means all the cool pipeline functionality is usable only in th public cloud and not in enterprise-internal environments, so far. I am sure we can make this work, though.

In the past, I had assumed the file-content of the configmap above only has custom CAs in it, but it also has the latest up-to-date Red Hat Core OS publicly-trusted CAs (Amazon, Google, GoDaddy, Verisign, you name it) in it as well, too.