getindata / kedro-vertexai

Kedro Plugin to support running workflows on GCP Vertex AI Pipelines
https://kedro-vertexai.readthedocs.io
Apache License 2.0
30 stars 9 forks source link

Difficulty Defining CPU and GPU Machine Types in Kedro-Vertex (vertexai.yml) #128

Open 7pandeys opened 8 months ago

7pandeys commented 8 months ago

Problem: I'm encountering difficulty in defining the CPU and GPU machine types with respect to nodes and pipelines in vertexai.yml within the Kedro-Vertex framework.

Expected Behavior: I expect to be able to specify the CPU and GPU machine types for nodes and pipelines in vertex.yml to effectively utilize CPU and GPU resources as needed.

Current Behavior: I've searched through the documentation and codebase but haven't found clear instructions on how to achieve this. This makes it challenging to optimize the resource utilization for my specific workflow.

Steps to Reproduce:

  1. Create a Kedro-Vertex project.
  2. Attempt to define CPU and GPU types for nodes and pipelines in vertexai.yml.
  3. Encounter difficulties or confusion in the process.

Additional Information:

Environment:

Suggested Solution: It would be helpful to provide more detailed documentation or examples on how to define CPU and GPU machine types for nodes and pipelines in vertex.yml. Alternatively, if this feature is not yet supported, it would be great to know the current status and any workarounds.

Related links https://github.com/getindata/kedro-vertexai/blob/develop/kedro_vertexai/config.py https://kedro-vertexai.readthedocs.io/en/0.9.1/source/02_installation/02_configuration.html

Notes: vertexai.yml is generated by command kedro vertexai init

This issue aims to improve resource management and clarity within Kedro-Vertex, making it easier for users to define CPU and GPU machine types for their nodes and pipelines. Your attention to this matter is greatly appreciated.

marrrcin commented 7 months ago

Hi @7pandeys, thanks for raising the issue.

The Resources configuration section on the page you've linked has exactly the information about using GPUs. Initial configuration generated by kedro vertexai init also creates the vertexai.yml which contains an example of configuration for nodes with GPUs on Vertex AI.

We're open to improvements on that part - what do you propose?


Config generated by kedro vertexai init

https://github.com/getindata/kedro-vertexai/blob/0bcb35e6dbe4eb2b26c1499065ec68a077e8bb9f/kedro_vertexai/config.py#L57-L78

image

7pandeys commented 7 months ago

@marrrcin thanks for response. Is there a specific parameter or syntax that allows us to specify the machine type or CPU type in the vertexai.yml configuration? If not, what would be the recommended approach to achieve this?

Related links https://cloud.google.com/compute/docs/cpu-platforms https://cloud.google.com/compute/docs/machine-resource

marrrcin commented 7 months ago

Follow this guide, our plugin is fully compatible with this approach: https://cloud.google.com/vertex-ai/docs/pipelines/machine-types

7pandeys commented 7 months ago

Follow this guide, our plugin is fully compatible with this approach: https://cloud.google.com/vertex-ai/docs/pipelines/machine-types

  1. Is it possible to define machine types directly within Kedro-VertexAI without relying on KFP?
  2. If not, are there plans or considerations for enabling this feature in future releases?
  3. Are there recommended workarounds or best practices for specifying machine types when not using KFP?
marrrcin commented 7 months ago

I don't understand your questions. You can configure machine types as you want in vertexai.yml - the configuration in the plugin exposes the configuration available in native Vertex AI. That means that whatever you define in the vertexai.yml configuration file, it will be used in the plugin to set appropriate CPU/memory/GPU resources + node selectors on the Vertex AI side, you don't have to use KFP directly.

   resources: 

     # For nodes that require more RAM you can increase the "memory" 
     data_import_step: 
       memory: 4Gi 

     # Training nodes can utilize more than one CPU if the algoritm 
     # supports it 
     model_training: 
       cpu: 8 
       memory: 8Gi 
       gpu: 1 

     # Default settings for the nodes 
     __default__: 
       cpu: 1000m 
       memory: 2048Mi 

   node_selectors: 
     model_training: 
       cloud.google.com/gke-accelerator: NVIDIA_TESLA_T4 

I suggest you try to configure our plugin first, then see whether it works for you and whether it matches your requirements on that part.