getindata / kedro-vertexai

Kedro Plugin to support running workflows on GCP Vertex AI Pipelines
https://kedro-vertexai.readthedocs.io
Apache License 2.0
33 stars 9 forks source link

Unable to Define and Use Different Docker Images for Different Tasks in Kedro-Vertex (vertexai.yml) #129

Open 7pandeys opened 11 months ago

7pandeys commented 11 months ago

Problem: I am facing difficulties in Kedro-Vertex when trying to define and use different Docker images for distinct tasks, such as data preprocessing and model training, model inference.

Expected Behavior: I expect to be able to specify and use separate Docker images for various tasks within my Kedro-Vertex workflow. This flexibility is crucial for optimizing resource utilization and dependencies for different stages of my vertex pipeline.

Current Behavior: I have scoured the documentation and explored the codebase but have not found clear instructions on how to achieve this feature. As a result, I'm uncertain about how to implement different Docker images for different tasks.

Steps to Reproduce:

  1. Set up a Kedro-Vertex project.
  2. Attempt to define different Docker images for preprocessing and training tasks.
  3. Encounter challenges or confusion during the process.

Additional Information:

Environment:

Suggested Solution: It would be incredibly valuable to provide documentation or examples demonstrating how to define and use different Docker images for various tasks within a Kedro-Vertex project. If this feature is not currently supported, it would be helpful to know its status and any potential workarounds.

Related links: https://github.com/getindata/kedro-vertexai/blob/develop/kedro_vertexai/config.py https://kedro-vertexai.readthedocs.io/en/0.9.1/source/02_installation/02_configuration.html

Notes: vertexai.yml is generated by command kedro vertexai init

This issue is aimed at improving the flexibility and resource management in Kedro-Vertex by allowing users to define and use different Docker images for different tasks. Your attention to this matter is greatly appreciated.

marrrcin commented 11 months ago

Hi @7pandeys, thanks for raising the issue!

The ability to define and use different Docker images for distinct tasks is not supported by this plugin at this point. The reason for this is that Kedro is primarily focused on creating reproducible pipelines rather than orchestration. As such, there is an assumption of a single Docker image per pipeline to make it easier for the users to use.

We understand that this might not align with your specific, fairly advanced use case, more focused on the orchestration part.

If you’re keen to contribute, we would be happy to help you design this feature and accept a PR - it would add a nice feature to the plugin 🙂

Lasica commented 11 months ago

Potential solution could be based on node/pipeline tags with tag-docker image dictionary provided in the config as optional with default image staying as is.

Lasica commented 11 months ago

Also could you expand @7pandeys why having distinct docker images for different tasks is important for you? What does it optimize? The only thing I can think of is when you want to use different architectures for different computing steps and code is incompatible for both using just a single image. Apart from that isn't just docker image size and thus network bandwidth optimized here?

7pandeys commented 11 months ago

Potential solution could be based on node/pipeline tags with tag-docker image dictionary provided in the config as optional with default image staying as is.

Hi @Lasica, can you please elaborate 🙂 ?

Lasica commented 11 months ago

I was refering to potential implementation of the solution. Such use case is not yet supported as marrrcin stated. I think that feature that adds job grouping based on tags could have also feature to differentiate params for such groups based on tags with some dictionary mapping those in the config.

Lasica commented 10 months ago

Could you elaborate why that feature is needed/useful? @7pandeys

rragundez commented 10 months ago

Hi @Lasica, IMO is normal once you get serious with production ready pipelines, to have different compute architectures underlying different steps and different dependencies for each step. An inference related step is in need of a much lighter compute and dependencies, compared to a training step which might need a GPU and specific packages and an image that is tightly coupled with the compute architecture, or compared with pre-processing which might need a heavy CPU.

I do understand that having a single image makes it easy for an entry level starting project but I do not see how you won't end with a giant image with all dependencies for each step (which might not be compatible), and in turn limiting the type of underlying compute that the pipeline can use.

Please let us know, if we are miss-understanding the way to use Kedro VertexAI plug in, or any advice on how the community is tackling this problem would be of great help. We are currently evaluating cloud agnostic tools for model pipelines and evaluating if we build our own internally. This feature is one of the requirements/questions that came up while discussing about Kedro.

and maybe one final question, is it possible to set a different machine type per step?

rragundez commented 10 months ago

Happy to evaluate how difficult would be to add this to Kedro BTW. We would need to see the SageMaker side as well, hope there is a Kedro SageMaker plugin

marrrcin commented 10 months ago

@rragundez Kedro-VertexAI plugin allows to define different compute resources per step - you can specify CPU/memory/GPU requirements as well as use node selectors to pick a specific machine type, e.g.:

# excerpt from vertexai.yml
# see https://kedro-vertexai.readthedocs.io/en/0.9.1/source/02_installation/02_configuration.html

  # Optional section allowing adjustment of the resources
  # reservations and limits for the nodes
  resources:

    # For nodes that require more RAM you can increase the "memory"
    data-import-node:
      memory: 2Gi

    # Training nodes can utilize more than one CPU if the algorithm
    # supports it
    model-training-node:
      cpu: 8
      memory: 60Gi

    # GPU-capable nodes can request 1 GPU slot
    tensorflow-node:
      gpu: 1

    # Resources can be also configured via nodes tag
    # (if there is node name and tag configuration for the same
    # resource, tag configuration is overwritten with node one)
    gpu_node_tag:
      cpu: 1
      gpu: 2

    # Default settings for the nodes
    __default__:
      cpu: 200m
      memory: 64Mi

  # Optional section allowing to configure node selectors constraints
  # like gpu accelerator for nodes with gpu resources.
  # (Note that not all accelerators are available in all
  # regions - https://cloud.google.com/compute/docs/gpus/gpu-regions-zones)
  # and not for all machines and resources configurations - 
  # https://cloud.google.com/vertex-ai/docs/training/configure-compute#specifying_gpus
  node_selectors:
    gpu_node_tag:
      cloud.google.com/gke-accelerator: NVIDIA_TESLA_T4
    tensorflow-step:
      cloud.google.com/gke-accelerator: NVIDIA_TESLA_K80

Using different docker image per step is not supported an will unlikely be supported because it plays against the Kedro design and against pipeline reproducibility. Kedro is not an orchestration framework.


As for your second question: https://github.com/getindata/kedro-vertexai/issues/129#issuecomment-1809509752

Happy to evaluate how difficult would be to add this to Kedro BTW. We would need to see the SageMaker side as well, hope there is a Kedro SageMaker plugin

See: https://github.com/getindata/kedro-sagemaker

Lasica commented 10 months ago

I think it's fair to request for all parameters that are available in the vertex ai node config api to be configurable somehow. We probably should probably take a fresh look at how this configuration should look like, taking into account the upcoming grouping feature that also should take into account other methods of grouping than just tags that has to keep config valid/consistent among the groups.

Such change however could be a breaking change so let's take time to plan it with deprecated use of old way.