kiersten-stokes commented 2 years ago

Is your feature request related to a problem? Please describe.

2083 set the groundwork for supporting component registries that include multiple components from a single source type (e.g. a directory containing `.yaml` or `.py` files). We also would like to support registries that contain component definitions from a Github repo.

Describe the solution you'd like Build out the support for searching through a Github repo for component definitions.

Considerations We will need to figure out how to discriminate between, e.g. files that are component definitions versus files (of the same type) that are not component definitions.

This article may be of use in designing a solution. We may also want to consider using the GitHub API.

Design

The structure of the component registry already has laid the groundwork to support GitHub-based repos and already includes the GitHubComponentReader class, which derives from UrlComponentReader. Only one class method would need to be updated: get_absolute_locations(). Each Reader class has such a method to break potentially multi-valued locations down into their constituent parts. For the GitHub reader, this method will take the list of paths to GitHub repo(s) given in the registry instance metadata and will return a list of paths to each component specification file within that registry.

I believe the lightest-weight implementation of this might include a single call to the GitHub API, specifically of the format:

https://api.github.com/[owner_name]/[repo_name]/contents

Here is the response from the call to a sample component registry repo with 2 component definitions:

[
  {
    "name": "pig_operator.py",
    "path": "pig_operator.py",
    "sha": "499161d1fac3df3f36743630c7799ba4a6aeb250",
    "size": 2707,
    "url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/pig_operator.py?ref=main",
    "html_url": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/pig_operator.py",
    "git_url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/499161d1fac3df3f36743630c7799ba4a6aeb250",
    "download_url": "https://raw.githubusercontent.com/kiersten-stokes/component-registries-airflow/main/pig_operator.py",
    "type": "file",
    "_links": {
      "self": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/pig_operator.py?ref=main",
      "git": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/499161d1fac3df3f36743630c7799ba4a6aeb250",
      "html": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/pig_operator.py"
    }
  },
  {
    "name": "sqllite_operator.py",
    "path": "sqllite_operator.py",
    "sha": "fb4a30e350250359d357fed87525fb6d167b756b",
    "size": 2037,
    "url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/sqllite_operator.py?ref=main",
    "html_url": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/sqllite_operator.py",
    "git_url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/fb4a30e350250359d357fed87525fb6d167b756b",
    "download_url": "https://raw.githubusercontent.com/kiersten-stokes/component-registries-airflow/main/sqllite_operator.py",
    "type": "file",
    "_links": {
      "self": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/sqllite_operator.py?ref=main",
      "git": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/fb4a30e350250359d357fed87525fb6d167b756b",
      "html": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/sqllite_operator.py"
    }
  }
]

The download_url value is what is of interest to us. As with the directory-based registries, only files with the correct file extension for that type of runtime processor (.py for Airflow and .yaml for KFP currently) will be considered. As usual, any files that cannot be successfully parsed for one reason or another are logged and skipped (outside of the get_absolute_locations method).

Limitations:

Requires the repos in question to be public
- i.e. supporting private repos would necessitate adding authentication metadata fields to each repo path given in the paths array of the registry
- This would be nice to support eventually, but won't make 3.2.0
Requires the user enter the correct repo URL(s) (e.g., https://github.com/[owner_name]/[repo_name])
- This isn't any different than our current requirements for other url-based registries though
This doesn't necessarily require that the repo contains only component specification files, but we will definitely want to test to ensure that if non-component specs are picked up for parsing, that we are catching it as early as possible and skipping the bulk of the parse/not throwing errors

I'm open to other ideas for looping through repo files to get content to parse. I think the API makes a lot of sense because it's an easy implementation (only one request per path entry) and keeps things url-based as they should be for a remote resource location. Based on my cursory research, I also don't think other methods would alleviate the limitations cited above for this method.

Questions:

I'm assuming we will want to also check subdirectories for yamls/operators as well as opposed to requiring a flat structure? This should't be too difficult because the value of the type key in the GitHub API response will include subdir for any folders

ptitzler commented 2 years ago

Since users can already specify public web resources as a source, non-bulk load scenarios are already somewhat supported. Therefore it might be better to defer until we have a better understanding how (KFP/AA) users currently manage component specifications.

akchinSTC commented 2 years ago

See https://github.com/elyra-ai/elyra/issues/2220

Ark-kun commented 2 years ago

Some relevant information:

KFP SDK has support for component search and listing: kfp.components.ComponentStore.search.

The Cloud Pipelines Pipeline Editor app has support for GitHub component search as well.

Pipeline Editor also supports syndicated component feeds: https://github.com/Ark-kun/pipeline_components/blob/pipeline_component_feed/pipeline_component_feed.yaml

thesuperzapper commented 2 years ago

@kiersten-stokes @ptitzler @akchinSTC I think we can implement a GitHub component catalog in a very similar way to the Artifactory Catalog Connector that will be added in https://github.com/elyra-ai/examples/pull/99.

The gist of the idea is that you have a folder structure like the following:

component_1/
   __COMPONENT__
   component-1.0.9.yaml
   component-1.0.10.yaml
component_2/
   hidden_component/
      __COMPONENT__
      component-1.0.0.yaml
      component-1.1.0.yaml
   __COMPONENT__
   component-1.0.0.yaml
   component-1.1.0.yaml
component_3/
   component-1.0.0.yaml
   component-1.1.0.yaml

Where the presence of a __COMPONENT__ marker file, tells the catalog connector that this folder contains components, and also stops further recursion (so in the above example component_2/hidden_component/ is NOT matched).

The connector has the following configs which change how the connector traverses the folder structure:

repository_path: the path to search under in the repo
max_recursion_depth: the maximum depth to recurse while looking for __COMPONENT__ markers
max_files_per_folder: the maximum number of files to return per folder
- _NOTE: can be used with file_ordering to ensure only the "latest" component version in each folder is returned_
file_filter: unix-like file name filter
- * match everything
- ? any single character
- [seq] character in seq
- [!seq] character not in seq
- [0-9] any number
file_ordering: order in which files are processed per folder
- _OPTIONS: NAME_ASCENDING, NAME_DESCENDING, VERSION_ASCENDING, VERSION_DESCENDING_
- NOTE: the version ordering uses something very similar to the very lenient packaging.version.LegacyVersion(), for example 1.0.10 > 1.0.9 (which is not true if simply considering alphanumeric ordering)

GitHub Example 1:

Configs:
=========
git_repo             = https://github.com/USERNAME/REPOSITORY.git
git_branch           = master
repository_path      = /
max_recursion_depth  = 3
max_files_per_folder = -1
file_filter          = *.yaml
file_ordering        = VERSION_DESCENDING

Matched:
=========
./component_1/component-1.0.9.yaml
./component_1/component-1.0.10.yaml
./component_2/component-1.0.0.yaml
./component_2/component-1.1.0.yaml

Notes:
=========
- the `component_3/` files are not matched as this folder does not contain a `__COMPONENT__` marker
- the `component_2/hidden_component/` files are not matched as recursion stops at the first `__COMPONENT__` marker

GitHub Example 2:

Configs:
=========
git_repo             = https://github.com/USERNAME/REPOSITORY.git
git_branch           = master
repository_path      = /
max_recursion_depth  = 3
max_files_per_folder = 1
file_filter          = *.yaml
file_ordering        = VERSION_DESCENDING

Matched:
=========
./component_1/component-1.0.10.yaml
./component_2/component-1.1.0.yaml

Notes:
=========
- the `file_ordering` is applied separately within each folder
- as `max_files_per_folder` is `1`, only ONE file from each folder is matched 
- as `file_ordering` is `VERSION_DESCENDING`, the file names are ordered as if they are version numbers
  (we use `packaging.version.LegacyVersion()` to preform the sort)
- the whole file-name is treated as a version, so "aaaa-1.0.0.yaml" is sorted before "bbbb-9.0.0.yaml"
  (take care not to change your file-name prefixes, or alternatively don't include a prefix and use "1.0.0.ymal")

Ark-kun commented 2 years ago

@thesuperzapper JFYI: Here is the component artifact directory structure that is assumed by certain methods of the Kubeflow Pipelines SDK: This structure also resembles the Docker image versioning (mutable tags and immutable @sha256:... digest versions).

group1/subgroup_1?/component_1/
   component.yaml # Latest component version
   versions/
      sha256/
         31df...712f  # Immutable content-hashed component file versions
      tags/
         stable # Mutable component versions (symlinks or copies)
         latest
         0.0.1
...
pipeline_component_repository.yaml # marks the location of a repository of Kubeflow Pipelines components

KFP SDK searches components using GitHub API with a query similar to filename:component.yaml (inputValue OR inputPath OR outputPath).

Also note that many components have the canonical_location annotation which allows assigning a component to versioned lineage regardless of location:

# In component.yaml:
metadata:
  annotations:
    canonical_location: 'https://raw.githubusercontent.com/Ark-kun/pipeline_components/master/components/google-cloud/Vertex_AI/Models/Upload_Tensorflow_model/component.yaml'

thesuperzapper commented 2 years ago

@Ark-kun I can't find any documentation on methods that allow versioning/searching for components, do you know where it is?

Also, I have raised https://github.com/kubeflow/pipelines/issues/7832, to propose that KFP natively adds component_id and component_version to the Component YAML spec, if you want to comment there.

Ark-kun commented 2 years ago

I can't find any documentation on methods that allow versioning/searching for components, do you know where it is? Here is the documentation for searching in the KFP SDK: https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.ComponentStore.search

And this part describes the directory structure: https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.ComponentStore.load_component

The kfp.components.structures.ComponentReference structure also has some relevance since it has name, digest and tag fields (But url is what's used the most).

The versioning is inspired by the ways that Git and DockerHub work (immutable hash refs, mutable tags).

I won't claim that this KFP SDK feature is used by many people. It's just that this existing structure might be good enough for some of what you want to achieve here.

In practice, this directory schema was rarely used, because GitHub's hashes/branches/tags made a good enough substitute:

https://raw.githubusercontent.com/Ark-kun/pipeline_components/master/components/google-cloud/Vertex_AI/Models/Upload_Tensorflow_model/component.yaml
https://raw.githubusercontent.com/Ark-kun/pipeline_components/0.0.1/components/google-cloud/Vertex_AI/Models/Upload_Tensorflow_model/component.yaml
https://raw.githubusercontent.com/Ark-kun/pipeline_components/31df...712f/components/google-cloud/Vertex_AI/Models/Upload_Tensorflow_model/component.yaml

The directory structure I described is mostly useful for flat storage locations like S3 or GCS.

One plus of the directory structure compared to GitHub is that GitHub uses unpredictable commit hashes while the directory structure uses hash based on component.yaml content. This allows committing a new component version and a new pipeline version (referencing the new component version) in the same commit.

Ark-kun commented 2 years ago

Note that in KFP SDK and http://pipeline.studio I search for components using GitHub APi and filename:component.yaml. And this query is already quite broad since we're not the only ones using the component.yaml file names. Adding version tag to the file names might complicate search a little bit, which is why I commented.

Ark-kun commented 2 years ago

I have raised https://github.com/kubeflow/pipelines/issues/7832, to propose that KFP natively adds component_id and component_version to the Component YAML spec, if you want to comment there.

TBH, I'm not sure whether this would be a measurable improvement over having this information in annotations. The idea of annotations is to provide a pathway for extensibility while maintaining backward and forward compatibility. It can be used as an experimental playground while the tools are being tested.

Note how canonical_location was added without changing the ComponentSpec schema. Without breaking old or new users. Additionally, I'm not fully sure how the component_version would be handled in a world where component file can be forked and changed. E.g. what happens when someone forks the component and makes some change, but does not update the version. What if they increase the version a lot? This is why I've used the canonical_location wording. Canonical location points to a repo and branch where the whole component lineage can be discovered and the latest version can be obtained. It also allows changing the "component ID" - the latest component yaml file will have canonical_location pointing to another directory/repo.

elyra-ai / elyra

Add support for Github-based component registries #2139

2083 set the groundwork for supporting component registries that include multiple components from a single source type (e.g. a directory containing `.yaml` or `.py` files). We also would like to support registries that contain component definitions from a Github repo.

Design

GitHub Example 1:

GitHub Example 2:

elyra-ai / elyra

Add support for Github-based component registries #2139

2083 set the groundwork for supporting component registries that include multiple components from a single source type (e.g. a directory containing .yaml or .py files). We also would like to support registries that contain component definitions from a Github repo.

Design

GitHub Example 1:

GitHub Example 2:

2083 set the groundwork for supporting component registries that include multiple components from a single source type (e.g. a directory containing `.yaml` or `.py` files). We also would like to support registries that contain component definitions from a Github repo.