awslabs / kubeflow-manifests

KubeFlow on AWS
https://awslabs.github.io/kubeflow-manifests/
Apache License 2.0
156 stars 116 forks source link

[Doc] Support TensorBoard in Kubeflow Pipelines #118

Open akartsky opened 2 years ago

akartsky commented 2 years ago

"Support TensorBoard in Kubeflow Pipelines" section of document is outdated : https://www.kubeflow.org/docs/distributions/aws/pipeline/#support-tensorboard-in-kubeflow-pipelines

Outdated Doc :

TensorBoard needs some extra settings on AWS like below:

  1. Create a Kubernetes secret aws-secret in the kubeflow namespace. Follow instructions here.

  2. Create a ConfigMap to store the configuration of TensorBoard on your cluster. Replace with your S3 region.

apiVersion: v1
kind: ConfigMap
metadata:
  name: ml-pipeline-ui-viewer-template
data:
  viewer-tensorboard-template.json: |
    {
        "spec": {
            "containers": [
                {
                    "env": [
                        {
                            "name": "AWS_ACCESS_KEY_ID",
                            "valueFrom": {
                                "secretKeyRef": {
                                    "name": "aws-secret",
                                    "key": "AWS_ACCESS_KEY_ID"
                                }
                            }
                        },
                        {
                            "name": "AWS_SECRET_ACCESS_KEY",
                            "valueFrom": {
                                "secretKeyRef": {
                                    "name": "aws-secret",
                                    "key": "AWS_SECRET_ACCESS_KEY"
                                }
                            }
                        },
                        {
                            "name": "AWS_REGION",
                            "value": "<your_region>"
                        }
                    ]
                }
            ]
        }
    }
  1. Update the ml-pipeline-ui deployment to use the ConfigMap by running kubectl edit deployment ml-pipeline-ui -n kubeflow.
    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
    name: ml-pipeline-ui
    namespace: kubeflow
    ...
    spec:
    template:
    spec:
      containers:
      - env:
        - name: VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH
          value: /etc/config/viewer-tensorboard-template.json
        ....
        volumeMounts:
        - mountPath: /etc/config
          name: config-volume
      .....
      volumes:
      - configMap:
          defaultMode: 420
          name: ml-pipeline-ui-viewer-template
        name: config-volume
akartsky commented 2 years ago

https://github.com/kubeflow/kubeflow/issues/6328

akartsky commented 2 years ago

Errors :

There are the errors that might see on the tensorboard pod when you try to use S3

1] This is caused because we need to specify AWS_REGION as environment variable for the pod

2022-03-17 19:17:23.774900: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError 'PermanentRedirect': The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.
2022-03-17 19:17:23.774947: E tensorflow/core/platform/s3/aws_logging.cc:60] HTTP response code: 301

2] This is caused because the pod does not have permissions to access the S3 bucket (pod will default to using Node IAM role if no secrets are provided and that will not have S3 access)

2022-03-17 18:51:05.696810: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered AWSError 'AccessDenied': Access Denied
2022-03-17 18:51:05.696857: E tensorflow/core/platform/s3/aws_logging.cc:60] HTTP response code: 403

Issue :

The current implementations of TensorBoard controller does not mount AWS secrets and doesn't have configMap for providing env variable inputs to tensorboard pod

https://github.com/kubeflow/kubeflow/blob/d224549f11b671c2ee9e97380e4525bb698c0a68/components/tensorboard-controller/controllers/tensorboard_controller.go#L252

Workaround :

This is Not a good workaround and you have to do this for every tensorboard pod that you launch.

1] Create AWS secrets in the kubeflow user namespace (This IAM user should have S3 access to the bucket) Eg:

apiVersion: v1
kind: Secret
metadata:
  name: aws-secret
  namespace: <your_kubeflow_user_namespace>
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <base_64_key>
  AWS_SECRET_ACCESS_KEY: <base_64_secret>

2] Launch a TensorBoard from the UI with S3 object storage link Eg: Name : <name_for_your_tensorboard> Object Storage Link : s3://<your_bucket_name> Current KF deployment uses TensorBoard version 2.1.0

3] Edit the deployment for the tensorboard pod that was just created

kubectl edit deployment <name_for_your_tensorboard> -n <your_kubeflow_user_namespace>

then add the following environment variables to it (on the same level as args, command and image)

env:
- name: AWS_REGION
  value: <your_s3_bucket_region>
- name: AWS_ACCESS_KEY_ID
  valueFrom:
    secretKeyRef:
      name: aws-secret
      key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
  valueFrom:
    secretKeyRef:
      name: aws-secret
      key: AWS_SECRET_ACCESS_KEY

Now if you go to the UI of the tensorboard that you had created it should be working.

Actual solution :

Make code changes in the TensorBoard controller

1] Modify the TensorBoard controller and provide a configMap input so that users can specify environment variables 2] Mount AWS credentials just like they are currently doing for GCS

Need to work on this PR

surajkota commented 1 year ago

Upstream issue: https://github.com/kubeflow/kubeflow/issues/6493