Open kiersten-stokes opened 2 years ago
Since users can already specify public web resources as a source, non-bulk load scenarios are already somewhat supported. Therefore it might be better to defer until we have a better understanding how (KFP/AA) users currently manage component specifications.
Some relevant information:
KFP SDK has support for component search and listing: kfp.components.ComponentStore.search.
The Cloud Pipelines Pipeline Editor app has support for GitHub component search as well.
Pipeline Editor also supports syndicated component feeds: https://github.com/Ark-kun/pipeline_components/blob/pipeline_component_feed/pipeline_component_feed.yaml
@kiersten-stokes @ptitzler @akchinSTC I think we can implement a GitHub component catalog in a very similar way to the Artifactory Catalog Connector
that will be added in https://github.com/elyra-ai/examples/pull/99.
The gist of the idea is that you have a folder structure like the following:
component_1/
__COMPONENT__
component-1.0.9.yaml
component-1.0.10.yaml
component_2/
hidden_component/
__COMPONENT__
component-1.0.0.yaml
component-1.1.0.yaml
__COMPONENT__
component-1.0.0.yaml
component-1.1.0.yaml
component_3/
component-1.0.0.yaml
component-1.1.0.yaml
Where the presence of a __COMPONENT__
marker file, tells the catalog connector that this folder contains components, and also stops further recursion (so in the above example component_2/hidden_component/
is NOT matched).
The connector has the following configs which change how the connector traverses the folder structure:
repository_path
: the path to search under in the repomax_recursion_depth
: the maximum depth to recurse while looking for __COMPONENT__
markersmax_files_per_folder
: the maximum number of files to return per folder
file_ordering
to ensure only the "latest" component version in each folder is returned_file_filter
: unix-like file name filter
*
match everything?
any single character[seq]
character in seq
[!seq]
character not in seq
[0-9]
any numberfile_ordering
: order in which files are processed per folder
NAME_ASCENDING
, NAME_DESCENDING
, VERSION_ASCENDING
, VERSION_DESCENDING
_packaging.version.LegacyVersion()
, for example 1.0.10 > 1.0.9
(which is not true if simply considering alphanumeric ordering)Configs:
=========
git_repo = https://github.com/USERNAME/REPOSITORY.git
git_branch = master
repository_path = /
max_recursion_depth = 3
max_files_per_folder = -1
file_filter = *.yaml
file_ordering = VERSION_DESCENDING
Matched:
=========
./component_1/component-1.0.9.yaml
./component_1/component-1.0.10.yaml
./component_2/component-1.0.0.yaml
./component_2/component-1.1.0.yaml
Notes:
=========
- the `component_3/` files are not matched as this folder does not contain a `__COMPONENT__` marker
- the `component_2/hidden_component/` files are not matched as recursion stops at the first `__COMPONENT__` marker
Configs:
=========
git_repo = https://github.com/USERNAME/REPOSITORY.git
git_branch = master
repository_path = /
max_recursion_depth = 3
max_files_per_folder = 1
file_filter = *.yaml
file_ordering = VERSION_DESCENDING
Matched:
=========
./component_1/component-1.0.10.yaml
./component_2/component-1.1.0.yaml
Notes:
=========
- the `file_ordering` is applied separately within each folder
- as `max_files_per_folder` is `1`, only ONE file from each folder is matched
- as `file_ordering` is `VERSION_DESCENDING`, the file names are ordered as if they are version numbers
(we use `packaging.version.LegacyVersion()` to preform the sort)
- the whole file-name is treated as a version, so "aaaa-1.0.0.yaml" is sorted before "bbbb-9.0.0.yaml"
(take care not to change your file-name prefixes, or alternatively don't include a prefix and use "1.0.0.ymal")
@thesuperzapper JFYI: Here is the component artifact directory structure that is assumed by certain methods of the Kubeflow Pipelines SDK:
This structure also resembles the Docker image versioning (mutable tags and immutable @sha256:...
digest versions).
group1/subgroup_1?/component_1/
component.yaml # Latest component version
versions/
sha256/
31df...712f # Immutable content-hashed component file versions
tags/
stable # Mutable component versions (symlinks or copies)
latest
0.0.1
...
pipeline_component_repository.yaml # marks the location of a repository of Kubeflow Pipelines components
KFP SDK searches components using GitHub API with a query similar to filename:component.yaml (inputValue OR inputPath OR outputPath)
.
Also note that many components have the canonical_location
annotation which allows assigning a component to versioned lineage regardless of location:
# In component.yaml:
metadata:
annotations:
canonical_location: 'https://raw.githubusercontent.com/Ark-kun/pipeline_components/master/components/google-cloud/Vertex_AI/Models/Upload_Tensorflow_model/component.yaml'
@Ark-kun I can't find any documentation on methods that allow versioning/searching for components, do you know where it is?
Also, I have raised https://github.com/kubeflow/pipelines/issues/7832, to propose that KFP natively adds component_id
and component_version
to the Component YAML spec, if you want to comment there.
I can't find any documentation on methods that allow versioning/searching for components, do you know where it is? Here is the documentation for searching in the KFP SDK: https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.ComponentStore.search
And this part describes the directory structure: https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.ComponentStore.load_component
The kfp.components.structures.ComponentReference structure also has some relevance since it has name, digest and tag fields (But url is what's used the most).
The versioning is inspired by the ways that Git and DockerHub work (immutable hash refs, mutable tags).
I won't claim that this KFP SDK feature is used by many people. It's just that this existing structure might be good enough for some of what you want to achieve here.
In practice, this directory schema was rarely used, because GitHub's hashes/branches/tags made a good enough substitute:
https://raw.githubusercontent.com/Ark-kun/pipeline_components/master/components/google-cloud/Vertex_AI/Models/Upload_Tensorflow_model/component.yaml
https://raw.githubusercontent.com/Ark-kun/pipeline_components/0.0.1/components/google-cloud/Vertex_AI/Models/Upload_Tensorflow_model/component.yaml
https://raw.githubusercontent.com/Ark-kun/pipeline_components/31df...712f/components/google-cloud/Vertex_AI/Models/Upload_Tensorflow_model/component.yaml
The directory structure I described is mostly useful for flat storage locations like S3 or GCS.
One plus of the directory structure compared to GitHub is that GitHub uses unpredictable commit hashes while the directory structure uses hash based on component.yaml content. This allows committing a new component version and a new pipeline version (referencing the new component version) in the same commit.
Note that in KFP SDK and http://pipeline.studio I search for components using GitHub APi and filename:component.yaml
. And this query is already quite broad since we're not the only ones using the component.yaml
file names.
Adding version tag to the file names might complicate search a little bit, which is why I commented.
I have raised https://github.com/kubeflow/pipelines/issues/7832, to propose that KFP natively adds component_id and component_version to the Component YAML spec, if you want to comment there.
TBH, I'm not sure whether this would be a measurable improvement over having this information in annotations
.
The idea of annotations
is to provide a pathway for extensibility while maintaining backward and forward compatibility.
It can be used as an experimental playground while the tools are being tested.
Note how canonical_location
was added without changing the ComponentSpec schema. Without breaking old or new users.
Additionally, I'm not fully sure how the component_version would be handled in a world where component file can be forked and changed. E.g. what happens when someone forks the component and makes some change, but does not update the version. What if they increase the version a lot? This is why I've used the canonical_location
wording. Canonical location points to a repo and branch where the whole component lineage can be discovered and the latest version can be obtained. It also allows changing the "component ID" - the latest component yaml file will have canonical_location
pointing to another directory/repo.
Is your feature request related to a problem? Please describe.
2083 set the groundwork for supporting component registries that include multiple components from a single source type (e.g. a directory containing
.yaml
or.py
files). We also would like to support registries that contain component definitions from a Github repo.Describe the solution you'd like Build out the support for searching through a Github repo for component definitions.
Considerations We will need to figure out how to discriminate between, e.g. files that are component definitions versus files (of the same type) that are not component definitions.
This article may be of use in designing a solution. We may also want to consider using the GitHub API.
Design
The structure of the component registry already has laid the groundwork to support GitHub-based repos and already includes the
GitHubComponentReader
class, which derives fromUrlComponentReader
. Only one class method would need to be updated:get_absolute_locations()
. EachReader
class has such a method to break potentially multi-valued locations down into their constituent parts. For the GitHub reader, this method will take the list of paths to GitHub repo(s) given in the registry instance metadata and will return a list of paths to each component specification file within that registry.I believe the lightest-weight implementation of this might include a single call to the GitHub API, specifically of the format:
https://api.github.com/[owner_name]/[repo_name]/contents
Here is the response from the call to a sample component registry repo with 2 component definitions:
The
download_url
value is what is of interest to us. As with the directory-based registries, only files with the correct file extension for that type of runtime processor (.py
for Airflow and.yaml
for KFP currently) will be considered. As usual, any files that cannot be successfully parsed for one reason or another are logged and skipped (outside of theget_absolute_locations
method).Limitations:
paths
array of the registryhttps://github.com/[owner_name]/[repo_name]
)I'm open to other ideas for looping through repo files to get content to parse. I think the API makes a lot of sense because it's an easy implementation (only one request per path entry) and keeps things url-based as they should be for a remote resource location. Based on my cursory research, I also don't think other methods would alleviate the limitations cited above for this method.
Questions:
type
key in the GitHub API response will includesubdir
for any folders