Closed prcastro closed 3 years ago
Hi. You are absolutely right. I missed this.
I haven't tried out this set of manifest because the actual manifest I used for our clusters are inherited from the main kubeflow manifest with significant modifications specific for our clusters.
I haven't switch ml-pipeline to IAM yet because of our strict bucket policy (the pipeline folder is hardcoded, and it doesn't meet our policies). I am still running minio service to store the pipeline templates, until my PR to update ml-pipeline is approved and merged. But IAM shld work for the rest.
Note that there is a bug in ml-pipeline-ui which I just fixed in v0.1.40, where the IAM session token is not refreshed.
But otherwise shld work unless I missed out something.
This is a good catch. I will update my PR to use the credential provider chain instead.
I probably can provide a forked image with the change if u need. Cuz I think this will take sometime as they seems to be quite bz to review my PR.
U can track my pr https://github.com/kubeflow/pipelines/pull/2080
Can this bug on the ml-pipeline-ui prevent the metrics to appear on the interface?
@prcastro
It should work for the first time. But will fail after the session expires. This applies to any S3 artifacts retrieved from the UI (aka argo artifacts and pod logs).
However, if u mean metrics from metadata server, then it will not be affected as it is stored in a database, not in s3. Only stuff stored in s3 (i.e. minio) uses my modified client.
But this shld be fixed in 1.40.0
. We have added unit test and refracted the code abit to be cleaner.
U can check out https://hub.docker.com/repository/docker/e2forks/ml-pipeline which is my forked build of the ml-pipeline with the new fix.
I have updated the PR to use a chained provider credentials. It will use first the API key in config.json -> minio env var -> aws env var -> IAM.
I have setup an automated build for this forked branch (for the PR). U shld see the build soon.
If u tried it, pls do tell me if this fix the issue? I also added the flags for region and secure to the minio client.
@prcastro
e2forks/ml-pipeline:iam
should work now. However, because config.json
has priority over env variables, u need to change the manifest slightly. U need to create a configmap to overwrite the default config.json
so that the accesskey will not be used (so that it can rollback to IAM).
I will make a commit later to update the manifest.
I performed some changes to the kustomize (changed the MINIO_SERVICE_SERVICE_PORT
env var, the image and config.json
file) and the final product of the ml-pipeline
was this:
apiVersion: apps/v1beta2
kind: Deployment
metadata:
annotations:
iam.amazonaws.com/role: my-role
labels:
app: ml-pipeline
name: ml-pipeline
namespace: kubeflow
spec:
selector:
matchLabels:
app: ml-pipeline
template:
metadata:
annotations:
iam.amazonaws.com/role: my-role
labels:
app: ml-pipeline
spec:
containers:
- env:
- name: OBJECTSTORECONFIG_BUCKETNAME
value: my-bucket
- name: MINIO_SERVICE_SERVICE_HOST
value: s3.amazonaws.com
- name: MINIO_SERVICE_SERVICE_PORT
value: "80"
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: e2forks/ml-pipeline:iam
imagePullPolicy: IfNotPresent
name: ml-pipeline-api-server
ports:
- containerPort: 8888
- containerPort: 8887
volumeMounts:
- mountPath: /config/config.json
name: config-volume
subPath: config.json
serviceAccountName: ml-pipeline
volumes:
- configMap:
name: ml-pipeline-config
name: config-volume
While the ml-pipeline-config
is a ConfigMap defined as:
apiVersion: v1
data:
config.json: |
{
"DBConfig": {
"DriverName": "mysql",
"DataSourceName": "",
"DBName": "mlpipeline",
"GroupConcatMaxLen": "4194304"
},
"ObjectStoreConfig": {
"AccessKey": "minio",
"SecretAccessKey": "minio123",
"BucketName": "mlpipeline",
"PipelineFolder": "pipelines"
},
"InitConnectionTimeout": "6m",
"DefaultPipelineRunnerServiceAccount": "pipeline-runner"
}
kind: ConfigMap
metadata:
name: ml-pipeline-config
namespace: kubeflow
The result was basically the same problem:
I0116 22:03:31.753685 7 client_manager.go:136] Initializing client manager
I0116 22:03:31.753815 7 config.go:45] Config DBConfig.ExtraParams not specified, skipping
[mysql] 2020/01/16 22:03:31 packets.go:427: busy buffer
[mysql] 2020/01/16 22:03:31 packets.go:408: busy buffer
E0116 22:03:31.888743 7 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
[mysql] 2020/01/16 22:03:31 packets.go:427: busy buffer
[mysql] 2020/01/16 22:03:31 packets.go:408: busy buffer
E0116 22:03:31.891798 7 db_status_store.go:71] Failed to commit transaction to initialize database status table
[mysql] 2020/01/16 22:03:31 packets.go:427: busy buffer
[mysql] 2020/01/16 22:03:31 packets.go:408: busy buffer
E0116 22:03:31.894787 7 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
F0116 22:03:31.911792 7 client_manager.go:342] Failed to create Minio bucket. Error: The AWS Access Key Id you provided does not exist in our records.
Am I missing something?
The config.json has precedent over environment variables.
U need to set the access key to an empty str, before u can use IAM. Because if the accessKey is set, it will use that instead.
IAM is the last fallback.
U can now also set the port to be an empty str. And set the protocol via the MINIO_SERVICE_SECURE flag.
Sorry about that! I fixed those issues and now it is working. The only problem I'm getting now is opening an successfully executed operation in a pipeline, in the Inputs/Outputs
tab, if I click a link to s3 I get the following error:
Failed to get object in bucket my-bucket at path runs/743f72cc-c331-4548-81b8-3fcd612c552a/my-pipeline/my-op-metrics.tgz: S3Error: Access Denied
How makes this request? The ml-pipeline-ui
? Because I found the following code:
https://github.com/kubeflow/pipelines/blob/master/frontend/server/handlers/artifacts.ts#L133
But it seems that the UI is already using IAM roles to authenticate
Checking the ml-pipeline-ui
logs, it is indeed receiving a request:
...
GET /pipeline/artifacts/get?source=s3&bucket=my-bucket&key=runs%2F743f72cc-c331-4548-81b8-3fcd612c552a%2Fmy-pipeline%2Fmy-op-metrics.tgz
Getting storage artifact at: s3: my-bucket/runs/743f72cc-c331-4548-81b8-3fcd612c552a/my-pipeline/my-op-metrics.tgz
Did u set the access key for UI to be empty? Because it follows the same behavior. If minio access key is provided, it will use it.
MINIO_ACCESS_KEY = ''
MINIO_SECRET_KEY = ''
Cuz by default, it is provided.
I ssh'ed into the UI container and those are the MINIO_* env vars that I found:
/server # env | grep MINIO
MINIO_SERVICE_PORT_9000_TCP=...
MINIO_SERVICE_PORT=...
MINIO_SERVICE_SERVICE_PORT=9000
MINIO_NAMESPACE=kubeflow
MINIO_SERVICE_PORT_9000_TCP_ADDR=...
MINIO_SERVICE_PORT_9000_TCP_PORT=...
MINIO_SERVICE_PORT_9000_TCP_PROTO=...
MINIO_SERVICE_SERVICE_HOST=...
I also didn't find any AWS_* env vars there. Also, I checked on the file produced by Kustomize, and none of those env vars appear on the UI Deployment. Anyway, I'll try to manually set that and see how it goes.
Tested setting the env vars (both the MinIO ones and the AWS ones) and the problem persists. The IAM role the UI is using is the same that I used to write the artifacts there, so it should work.
Sorry, I was confused. UI handled it differently. There are minio and AWS config. And yes, by default AWS config is empty so it will fallback to IAM.
Can u check if the file is actually saved to the bucket?
And does ur IAM permission has getObject permission?
Although the path looks suspiciously wrong. The key for the artifact. Cuz it usually shld have a folder before it, instead of starting with a run id.
Ok I think I found the bug. It is in ml pipelines.
I introduce some changes which broke the path for some other part of the code.
So the path are resolved wrongly.
Let me see if I can fix it. But meanwhile u can look in ur s3 bucket and try querying UI with the correct key?
Ok I think I fixed it. Building the image and trying again.
But meanwhile u can look in ur s3 bucket and try querying UI with the correct key?
The path on the s3 bucket seems to be just fine:
$ aws s3 ls s3://my-bucket/runs/743f72cc-c331-4548-81b8-3fcd612c552a/my-container/my-op-metrics.tgz
2020-01-16 20:06:38 160 my-op-metrics.tgz
This is exactly the same path that appears on the UI
Testing the new image
Same error is happening. I tried to check if the kube2iam
pod on the same node was logging when ml-pipeline-ui
requested credentials, but doesn't seem to happen (even for the ml-pipeline
requests, which work).
I also checked and the role seem to allow getObject
on this bucket.
Tomorrow I'll try to debug it further.
@prcastro I just updated the manifest to v0.2.3 of kubeflow pipelines.
The IAM should work now, as my PR has went in.
I tried to test, but I'm struggling with #7 . I'll wait for a fix and then I'll try again.
Sorry for the trouble. This is what happens when u code 24 hrs straight. Finally fixed everything.
Bugs fixed:
ml-pipeline API endpoint is properly set in UI
metadata envoy endpoint is set in UI
Argo is configured to be namespace mode instead of cluster mode
docker entry point is fixed
MySQL service selector fixed
Now default namespace is also set to be kubeflow.
We have configured this setup using kube2iam for multiple namespaces on our EKS cluster. Let us know if you still face any issues.
@lzuwei why do you use kube2iam on EKS? Doesn't the service account IAM support meet your need?
Mostly because iam service account for eks was only introduced like a few months ago.
We probably should migrate, but it is not high priority at the moment.
It shld be trivial for u to adapt this repo for IAM service account. U just need to annotate the ml-pipeline-ui
and |ml-pipeline` service account. And update the tensorboard pod definition template with the appropriate service account.
If I can find time, I will add 1 more overlay for IAM service account. U are welcome to make a PR too.
I used the IAM overlay to configure my Kubeflow deployment to use my S3 bucket. Just configuring the bucket, prefix, region and role, I get this for the
ml-pipeline
deployment:I also tried changing the port to 80, but the following error appeared:
The default configuration seem to work without problems. My guess is that the
ml-pipeline
doesn't support IAM auth. I was lead to believe this since it instances the MinIo client explicitly passing the accessKey and the secretKey.https://github.com/kubeflow/pipelines/blob/dc34a3568d79dd96c908703869596dcf6514bf52/backend/src/apiserver/client/minio.go#L29-L30
Do you know if
ml-pipelines
supports IAM? If so, how do you achieved IAM authentication in your cluster?