NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

Unable to add AWS S3 remote backend or attach other AIS clusters #137

Closed jyuwei closed 1 year ago

jyuwei commented 1 year ago

Setup

Environment: K8s Deploy method: ais-k8s operator Operator image used: aistorage/ais-operator:0.94 AIStore cluster custom resource:

apiVersion: ais.nvidia.com/v1beta1
kind: AIStore
metadata:
  name: aistore
spec:
  # Add fields here
  awsSecretName: "aws-s3-secret"
  configToUpdate:
    backend:
      conf:
        aws: {}
          #cloud_region: "us-east-1"
          #endpoint: "s3://694596843551.s3-control.us-east-1.amazonaws.com"
  size: 1
  proxySpec:
    servicePort: 51080
    portPublic: 51080
    portIntraControl: 51081
    portIntraData: 51082

  targetSpec:
    servicePort: 51081
    portPublic: 51081
    portIntraControl: 51082
    portIntraData: 51083

    mounts:
      - path: "/ais/sdc"
        size: 100Gi
      - path: "/ais/sdd"
        size: 100Gi
      - path: "/ais/sde"
        size: 100Gi
      - path: "/ais/sdf"
        size: 100Gi
      - path: "/ais/sdg"
        size: 100Gi

    # In certain environments (e.g. minikube), storage volumes attached to AIS targets may not have associated block devices.
    # Alternatively, AIS targets may "see" multiple mountpath directories sharing a single given block device.
    # In both of those cases, set allowSharedNoDisks = true (but note that this setting is **not recommended** for production).
    allowSharedNoDisks: true

  nodeImage: "aistorage/aisnode:3.17"
  initImage: "aistorage/ais-init:latest"
  hostpathPrefix: "/etc/ais"

  # To be able to access the AIS deployment outside kubernetes cluster, set:
  # enableExternalLB: true
  # NOTE: For external access to work, the kubernetes cluster should have the capabilitly to create LoadBalancer services with valid external IP.
  # For local testing with `minikube` run `minikube tunnel` in background for emulation. ref: https://minikube.sigs.k8s.io/docs/commands/tunnel/
  enableExternalLB: false

Notes

Issues

Unable to add S3 remote backend

Using the aisnode-debug pod, I tried to add S3 as a remote back end, but recieved the following error: Error: aws-error[MissingRegion: could not find region configuration]

root@60d9ee33e789:/ # kubectl --kubeconfig /root/.kube/ais1-us-west -n ais exec -it aisnode-debug -- /bin/bash
root@aisnode-debug:/# export AIS_ENDPOINT=http://aistore-proxy:51080
root@aisnode-debug:/# ais cluster show
PROXY        MEM USED(%)     MEM AVAIL   LOAD AVERAGE    UPTIME  K8s POD         STATUS  VERSION     BUILD TIME
p[rlMBSvXn][P]   0.17%       29.57GiB    [0.1 0.1 0.2]   7d11h   aistore-proxy-0     online  3.17.094eb3d    2023-04-14T14:44:50+0000

TARGET       MEM USED(%)     MEM AVAIL   CAP USED(%)     CAP AVAIL   LOAD AVERAGE    REBALANCE   UPTIME  K8s POD         STATUS  VERSION     BUILD TIME
t[SpeBQnwq]  0.18%       29.57GiB    0%      496.114GiB  [0.1 0.1 0.2]   -       7d11h   aistore-target-0    online  3.17.094eb3d    2023-04-14T14:44:50+0000

Summary:
   Proxies:     1
   Targets:     1
   Cluster Map:     version 4, UUID rhdJDckM8, primary p[rlMBSvXn]
   Deployment:      K8s
   Status:      2 online
   Rebalance:       n/a
   Authentication:  disabled
   Version:     3.17.094eb3d
   Build:       2023-04-14T14:44:50+0000
root@aisnode-debug:/# ais config cluster backend.conf='{"aws":{"cloud_region": "us-east-1", "endpoint": "s3://694596843551.s3-control.us-east-1.amazonaws.com"}}'
PROPERTY     VALUE
backend.conf     map[aws:map[cloud_region:us-east-1 endpoint:s3://694596843551.s3-control.us-east-1.amazonaws.com]]

Cluster config updated
root@aisnode-debug:/# ais config cluster backend.conf -j

    "backend": {"aws":{"cloud_region":"us-east-1","endpoint":"s3://694596843551.s3-control.us-east-1.amazonaws.com"}}

root@aisnode-debug:/# ais ls --all
Error: aws-error[MissingRegion: could not find region configuration]
root@aisnode-debug:/#

aistore-proxy pod log message:

E 18:26:21.946674 err.go:854 aws-error[MissingRegion: could not find region configuration]: GET /v1/buckets (called by p[rlMBSvXn]) (p[rlMBSvXn]: htrun.go:1106 <- proxy.go:2040 <- proxy.go:549 <- proxy.go:378])

Unable to attach a remote AIS cluster

I have used the same method to set up another AIStore cluster on K8s: ais2, but when trying to attach it as a remote cluster on ais1, the remote cluster was not added, even though the CLI returned success message

root@60d9ee33e789:/ # kubectl --kubeconfig /root/.kube/ais1-us-west -n ais exec -it aisnode-debug -- /bin/bash
root@aisnode-debug:/# export AIS_ENDPOINT=http://aistore-proxy:51080
root@aisnode-debug:/#
root@aisnode-debug:/#
root@aisnode-debug:/# curl https://ais2.xxxxxx.com/v1/cluster?what=stats
{"proxy":{"snode":{"public_net":{"node_ip_addr":"10.244.0.14","daemon_port":"51080","direct_url":"http://10.244.0.14:51080"},"intra_data_net":{"node_ip_addr":"aistore-proxy-0.aistore-proxy.ais.svc.cluster.local","daemon_port":"51082","direct_url":"http://aistore-proxy-0.aistore-proxy.ais.svc.cluster.local:51082"},"intra_control_net":{"node_ip_addr":"aistore-proxy-0.aistore-proxy.ais.svc.cluster.local","daemon_port":"51081","direct_url":"http://aistore-proxy-0.aistore-proxy.ais.svc.cluster.local:51081"},"daemon_type":"proxy","daemon_id":"lCQwgIIk","flags":0},"tracker":{"up.ns.time":736250000763562,"get.ns":0,"get.n":6,"kalive.ns":1161980,"put.n":8,"lst.n":9,"lst.ns":89865190},"capacity":{"Mountpaths":null,"pct_max":0,"pct_avg":0,"cs_err":""}},"target":{"OfdqNTnw":{"snode":{"public_net":{"node_ip_addr":"10.244.0.15","daemon_port":"51081","direct_url":"http://10.244.0.15:51081"},"intra_data_net":{"node_ip_addr":"aistore-target-0.aistore-target.ais.svc.cluster.local","daemon_port":"51083","direct_url":"http://aistore-target-0.aistore-target.ais.svc.cluster.local:51083"},"intra_control_net":{"node_ip_addr":"aistore-target-0.aistore-target.ais.svc.cluster.local","daemon_port":"51082","direct_url":"http://aistore-target-0.aistore-target.ais.svc.cluster.local:51082"},"daemon_type":"target","daemon_id":"OfdqNTnw","flags":0},"tracker":{"disk.sde.write.bps":0,"disk.sdd.write.bps":0,"disk.sdd.avg.wsize":0,"disk.sdg.util":0,"put.ns":3927378,"disk.sde.read.bps":0,"lst.ns":63196837,"disk.sdg.read.bps":0,"get.ns":0,"disk.sdd.util":0,"disk.sdf.util":0,"append.ns":0,"disk.sdd.read.bps":0,"disk.sde.util":0,"disk.sdg.avg.wsize":0,"disk.sdc.read.bps":0,"disk.sde.avg.rsize":0,"disk.sdc.write.bps":0,"put.bps":5200,"disk.sdf.write.bps":0,"disk.sdc.avg.wsize":0,"disk.sdg.write.bps":0,"get.bps":0,"dl.ns":0,"kalive.ns":111297415399,"dsort.creation.req.ns":0,"disk.sdf.read.bps":0,"disk.sdd.avg.rsize":0,"get.size":0,"put.redir.ns":1450570,"put.n":2,"put.size":5200,"up.ns.time":736240000722673,"disk.sdf.avg.rsize":0,"lst.n":9,"disk.sdc.avg.rsize":0,"disk.sdc.util":0,"dsort.creation.resp.ns":0,"get.redir.ns":0,"disk.sdg.avg.rsize":0,"disk.sde.avg.wsize":0,"disk.sdf.avg.wsize":0},"capacity":{"Mountpaths":{"/ais/sdc":{"used":"782204928","avail":"106539548672","pct_used":0,"disks":["sdd"],"fs":"/dev/sdd(xfs)"},"/ais/sdd":{"used":"782192640","avail":"106539560960","pct_used":0,"disks":["sdf"],"fs":"/dev/sdf(xfs)"},"/ais/sde":{"used":"782192640","avail":"106539560960","pct_used":0,"disks":["sde"],"fs":"/dev/sde(xfs)"},"/ais/sdf":{"used":"782196736","avail":"106539556864","pct_used":0,"disks":["sdg"],"fs":"/dev/sdg(xfs)"},"/ais/sdg":{"used":"782188544","avail":"106539565056","pct_used":0,"disks":["sdc"],"fs":"/dev/sdc(xfs)"}},"pct_max":0,"pct_avg":0,"cs_err":""}}
}}
root@aisnode-debug:/# ais cluster remote-attach ais2=https://ais2.xxxxxx.com
Remote cluster (ais2=https://ais2.xxxxxx.com) successfully attached
root@aisnode-debug:/# ais config cluster backend.conf -j

    "backend": {"ais":{"ais2":["https://ais2.xxxxx.xom"]},"aws":{"cloud_region":"us-east-1","endpoint":"s3://xxxxxxxx.s3-control.us-east-1.amazonaws.com"}}

root@aisnode-debug:/# ais show remote-cluster
UUID  URL  Alias  Primary  Smap  Targets  Uptime
root@aisnode-debug:/# ais ls ais

root@aisnode-debug:/#

aistore-proxy pod log message:

I 18:35:43.896885 prxclu.go:1328 p[rlMBSvXn]: attach remote cluster [alias ais2 => https://ais2.xxxxxx.com]
W 18:35:43.898171 config.go:1247 control and data share one intra-cluster network (aistore-proxy-0.aistore-proxy.ais.svc.cluster.local)
I 18:35:43.899204 metasync.go:420 p[rlMBSvXn]: sync Conf v9, action attach
E 18:35:45.903347 proxy.go:3033 p[rlMBSvXn]: retrying remais ver=0 (4 attempts)
E 18:35:47.904997 proxy.go:3033 p[rlMBSvXn]: retrying remais ver=0 (3 attempts)
E 18:35:49.907620 proxy.go:3033 p[rlMBSvXn]: retrying remais ver=0 (2 attempts)
E 18:35:51.910150 proxy.go:3033 p[rlMBSvXn]: retrying remais ver=0 (1 attempts)
E 18:35:53.912354 proxy.go:3033 p[rlMBSvXn]: retrying remais ver=0 (0 attempts)
I 18:35:53.912377 proxy.go:3038 p[rlMBSvXn]: remais v0 => v0
saiprashanth173 commented 1 year ago

K8s secret aws-s3-secret was created before hand with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION

@jyuwei do you mind sharing how your aws secrets look. Currently, we only support secrets that are in AWS credentials format. Something that looks like:

[default]
aws_access_key_id=<>
aws_secret_access_key=<>
region=<>

Once you have your credentials file in this format, you can create your secret using:

$ kubectl create secret generic aws-s3-secret --from-file=credentials=<path to your creds file in above format>

Unfortunately, we don't support providing/overriding region and endpoints using ais config: ais config cluster backend.conf='{"aws":{"cloud_region": "us-east-1", "endpoint": "s3://694596843551.s3-control.us-east-1.amazonaws.com"}}'. One option is to provide AWS config file along with credentials, if you plan on having additional AWS specific config:

$ kubectl create secret generic aws-s3-secret --from-file=credentials=<path to your creds file in above format> --from-file=config=<path to file containing your aws config>
jyuwei commented 1 year ago

@saiprashanth173 - Thanks for the response. I had created an Opaque type K8s secret with kubectl apply -f <path_to_aws_s3_secret_resource>

apiVersion: v1
data:
  AWS_ACCESS_KEY_ID: QUxxxxxxxx=
  AWS_DEFAULT_REGION: dXMtd2VzdC0x
  AWS_SECRET_ACCESS_KEY: a3xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx==
kind: Secret
metadata:
  name: aws-s3-secret
type: Opaque

I will try the kubectl create secret generic aws-s3-secret --from-file=credentials= approach as you suggested.

A couple of follow up questions:

Thanks again.

alex-aizman commented 1 year ago

in re remote AIS:

HTTP or HTTPS is a global choice - if the cluster listens to HTTP it'll use HTTP for all external and intra-cluster comm-s. And vice versa.

$ ais config cluster net --json

    "net": {
        "l4": {
            "proto": "tcp",
            "sndrcv_buf_size": 131072
        },
        "http": {
            "server_crt": "server.crt",
            "server_key": "server.key",
            "write_buffer_size": 65536,
            "read_buffer_size": 65536,
            "use_https": false,
            "skip_verify": false,
            "chunked_transfer": true
        }
    }
alex-aizman commented 1 year ago

Separately, it'll help if you start using the latest master which is almost ready for v3.18 - will release it soon.

jyuwei commented 1 year ago

@alex-aizman - Thank you for the response!

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
  labels:
    app: aistore
    component: proxy
    function: gateway
  name: aistore-proxy
  namespace: ais
spec:
  rules:
  - host: ais1.xxxxxx.com
    http:
      paths:
      - backend:
          service:
            name: aistore-proxy
            port:
              number: 51080
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - ais1.xxxxxx.com
    secretName: example-tls-cert
# cat /root/.config/ais/cli/cli.json
{
  "cluster": {
    "url": "https://ais1.xxxxxx.com",
    "default_ais_host": "https://ais1.xxxxxx.com",
    "default_docker_host": "http://172.50.0.2:8080",
    "skip_verify_crt": false
  },
  "timeout": {
    "tcp_timeout": "60s",
    "http_timeout": "0s"
  },
  "auth": {
    "url": "http://127.0.0.1:52001"
  },
  "aliases": {
    "cp": "bucket cp",
    "create": "bucket create",
    "get": "object get",
    "ls": "bucket ls",
    "put": "object put",
    "start": "job start",
    "stop": "job stop",
    "wait": "job wait"
  },
  "default_provider": "ais",
  "no_color": false
}
# ais cluster remote-attach ais2=https://ais2.xxxxxx.com
Remote cluster (ais2=https://ais2.xxxxxx.com) successfully attached
# ais config cluster backend.conf -j

    "backend": {"ais":{"ais2":["https://ais2.xxxxxx.com"]},"aws":{"cloud_region":"us-east-1","endpoint":"s3://69xxxxxxxx.s3-control.us-east-1.amazonaws.com"}}

I'll try with the latest master as well. Thanks for letting me know.

saiprashanth173 commented 1 year ago

@jyuwei I could reproduce your issue on a local deployment (non-k8s). Creating secret with correct credential format should fix your issue with AWS.


$ cat ~/.aws/credentials
AWS_ACCESS_KEY_ID=<key>
AWS_DEFAULT_REGION=<>
AWS_SECRET_ACCESS_KEY=<>

$ make deploy 
Enter number of storage targets:
1
Enter number of proxies (gateways):
1
Number of local mountpaths (enter 0 for preconfigured filesystems):
1
Select backend providers:
Amazon S3: (y/n) ?
y
Google Cloud Storage: (y/n) ?
n
Azure: (y/n) ?
n
HDFS: (y/n) ?
n
Loopback device size, e.g. 10G, 100M (creating loopbacks first time may take a while, press Enter to skip): 

Building aisnode 1a6eafa73 [build tags: aws mono]
go: downloading github.com/aws/aws-sdk-go v1.44.264
done.
Proxy is listening on port: 8080

$ ais ls aws:// --all
E 09:54:04.959094 t[iQit8081]: failed to list buckets s3://, err: aws-error[MissingRegion: could not find region configuration]: GET /v1/buckets (called by p[akvp8080]) (stack: [htrun.go:1155 <- tgtbck.go:149 <- tgtbck.go:70 <- target.go:537])
E 09:54:04.962679 t[iQit8081]: failed to list buckets s3://, err: aws-error[MissingRegion: could not find region configuration]: GET /v1/buckets (called by p[akvp8080]) (p[akvp8080]: htrun.go:1155 <- proxy.go:2055 <- proxy.go:564 <- proxy.go:372])
Error: t[iQit8081]: failed to list buckets s3://, err: aws-error[MissingRegion: could not find region configuration]
bboychev commented 1 year ago

Hello @alex-aizman, @saiprashanth173, guys,

Sorry for hijacking that issue for something not exactly related with it. I am trying to configure specific AWS IAM policy and attach it to specific AWS IAM user, so that we allow access to specific AWS S3 bucket only but I am receiving an error when using ais ls s3://test-bucket-aisdev while it works fine with aws s3 ls s3://test-bucket-aisdev (see details below). Of course we use same AWS credentials and region. Any idea what am I doing wrong?

user@aisbox-1:~/go/src/github.com/NVIDIA/aistore$ ais ls --all
NAME                                     PRESENT
s3://bucket_1                                        no
s3://bucket_2                                        no
...
s3://test-bucket-aisdev                                  no
Total: [AWS buckets: 100 (0 present)] ========

user@aisbox-1:~/go/src/github.com/NVIDIA/aistore$
user@aisbox-1:~/go/src/github.com/NVIDIA/aistore$ ais ls s3://test-bucket-aisdev
E 10:49:08.985605 t[GJDt8088]: failed to HEAD remote bucket s3://test-bucket-aisdev, err: aws-error[AccessDenied: Access Denied]: HEAD /v1/buckets/test-bucket-aisdev (called by p[mgSp8080]) (stack: [htrun.go:1155 <- tgtbck.go:518 <- target.go:548])
E 10:49:08.986266 t[GJDt8088]: failed to HEAD remote bucket s3://test-bucket-aisdev, err: aws-error[AccessDenied: Access Denied]: HEAD /v1/buckets/test-bucket-aisdev (called by p[mgSp8080]) (p[mgSp8080]: htrun.go:1155 <- prxbck.go:249 <- prxbck.go:238 <- proxy.go:597 <- proxy.go:372])
Error: t[GJDt8088]: failed to HEAD remote bucket s3://test-bucket-aisdev, err: aws-error[AccessDenied: Access Denied]
user@aisbox-1:~/go/src/github.com/NVIDIA/aistore$
user@aisbox-1:~$ aws s3 ls s3://test-bucket-aisdev
                           PRE test-dir-aisdev
user@aisbox-1:~$

My AWS IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowRWBucketAndObjects",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:ListBucketVersions",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:GetObjectTagging",
                "s3:PutObject",
                "s3:AbortMultipartUpload",
                "s3:DeleteObject",
                "s3:DeleteObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::test-bucket-aisdev",
                "arn:aws:s3:::test-bucket-aisdev/*"
            ]
        },
        {
            "Sid": "AllowListAllBuckets",
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

By the way: Can I have more than one AWS profile configured to access multiple S3 buckets from different AWS organizations for example?

I am using Version: 3.17.3cf1d5271 at the moment.

Best Regards, bboychev

alex-aizman commented 1 year ago

maybe add

"Action": [
    "s3:GetBucketVersioning"
]

It is easy to check - here's the piece of code and the two operations it executes using aws-sdk-go:

bboychev commented 1 year ago

Hello @alex-aizman,

Thank you! It works fine with appending "s3:GetBucketVersioning" action in the AWS IAM policy above:

user@aisbox-1:~/go/src/github.com/NVIDIA/aistore$ ais ls s3://test-bucket-aisdev
NAME                     SIZE    CACHED  
test-dir-aisdev/                     0B  no  
user@aisbox-1:~/go/src/github.com/NVIDIA/aistore$

I have tried to configure more than one (the default) profile in ~/.aws/credentials and ~/.aws/config but I was not able to make it work with ais CLI. I do not see option to use a specific AWS profile as well. Can I have more than one AWS profile configured to access multiple S3 buckets from different AWS organizations for example?

Best Regards, bboychev