ESIPFed / esiphub-dev

Development JupyterHub on AWS targeting pangeo environment for National Water Model exploration
MIT License
2 stars 1 forks source link

Build cluster in us-east using kops #23

Closed rsignell-usgs closed 6 years ago

rsignell-usgs commented 6 years ago

We would like to have a cluster built using kops so that we have a better chance of using autoscaling, and because the AWS landsat data is on US-EAST.

rsignell-usgs commented 6 years ago

Yesterday I took a one day AWS class on kubernetes (and they suggested kops over EKS for now). They used this guide: https://github.com/aws-samples/aws-workshop-for-kubernetes and following the initial steps resulted in creating this IAM role: 2018-08-22_10-50-23 which seems to meet the requirements of the first step the Zero to JupyterHub guide

rsignell-usgs commented 6 years ago

Following the rest of the z2jh guide step zero I assigned the above role to the small (t2.small) instance we are using for the CI host, which we can access via:

ssh -i "kops.pem" ec2-user@ec2-18-208-141-112.compute-1.amazonaws.com
rsignell-usgs commented 6 years ago

I created a cluster with these specs based on some advice from Jacob:

kops create cluster kopscluster.k8s.local \
  --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f  \
  --authorization RBAC \
  --master-size t2.small \
  --master-volume-size 10 \
  --node-size m5.2xlarge \
  --master-count 3 \
  --node-count 2 \
  --node-volume-size 120 \
  --yes

I then followed the z2jh guide up through "Setting up Helm"

rsignell-usgs commented 6 years ago

Wow, I finally overcame the struggles by destroying and recreating the cluster using m4 instances instead of m5. What a nightmare!

rsignell-usgs commented 6 years ago

So this is what finally worked:

kops create cluster kopscluster.k8s.local \
  --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f  \
  --authorization RBAC \
  --master-size t2.small \
  --master-volume-size 10 \
  --node-size m4.2xlarge \
  --master-count 3 \
  --node-count 2 \
  --node-volume-size 120 \
  --yes
rsignell-usgs commented 6 years ago

According to @jacobtomlinson here: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/870#issuecomment-416883355 it appears we could run m5 instances by just adding:

--image kope.io/k8s-1.8-debian-stretch-amd64-hvm-ebs-2018-02-08

to the above kops create cluster command.

rsignell-usgs commented 6 years ago

Summary of Pangeo install on AWS

The pangeo helm chart layers on the JupyterHub helm chart, so the instructions for Pangeo and JupyterHub are the same up to the helm install step. So the recipe is:

Follow the zerotojupyterhub guide (https://zero-to-jupyterhub-with-kubernetes.readthedocs.io/en/latest/) for deployment on AWS, until you get to the "Setting up JuptyerHub" page (https://zero-to-jupyterhub-with-kubernetes.readthedocs.io/en/latest/setup-jupyterhub.html)

We used kops to create the cluster:

kops create cluster kopscluster.k8s.local \
  --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f  \
  --authorization RBAC \
  --master-size t2.small \
  --master-volume-size 10 \
  --node-size m4.2xlarge \
  --master-count 3 \
  --node-count 2 \
  --node-volume-size 120 \
  --yes

and enabled autoscaling following I followed these instructions: https://akomljen.com/kubernetes-cluster-autoscaling-on-aws/ with these settings:

helm install --name autoscaler \
    --namespace kube-system \
    --set image.tag=v1.2.1 \
    --set autoDiscovery.clusterName=kopscluster.k8s.local \
    --set extraArgs.balance-similar-node-groups=false \
    --set extraArgs.expander=random \
    --set rbac.create=true \
    --set rbac.pspEnabled=true \
    --set awsRegion=us-east-1 \
    --set nodeSelector."node-role\.kubernetes\.io/master"="" \
    --set tolerations[0].effect=NoSchedule \
    --set tolerations[0].key=node-role.kubernetes.io/master \
    --set cloudProvider=aws \
    stable/cluster-autoscaler

I set up autoscaling groups in each zone (us-east-1a, us-east-1b, ... us-east-1f) by running commands like:

kops edit ig nodes-us-east-1a-m4-2xlarge.kopscluster.k8s.local

and dropping in info like this:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-09-10T13:27:48Z
  labels:
    kops.k8s.io/cluster: kopscluster.k8s.local
  name: nodes-us-east-1a-m4-2xlarge.kopscluster.k8s.local
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/node-template/label: ""
    kubernetes.io/cluster/kopscluster.k8s.local: owned
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  machineType: m4.2xlarge
  maxSize: 50
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-east-1a-m4-2xlarge.kopscluster.k8s.local
  role: Node
  rootVolumeSize: 120
  subnets:
  - us-east-1a

After this it's just configuring and install the Pangeo Helm Chart, starting with step 4 of the Pangeo instructions here: https://github.com/pangeo-data/pangeo/blob/master/docs/setup_guides/cloud.rst

rsignell-usgs commented 5 years ago

I also installed the helm chart for the awesome s3-fuse-flex-volume pysssix and goofys flexVolumes from the Informatics Lab, giving the ability to read any s3 bucket as /s3/<bucket>, and write to an s3 bucket on /scratch.

My worker-template.yaml looks like this:

metadata:
spec:
  restartPolicy: Never
  volumes:
    - flexVolume:
        driver: informaticslab/pysssix-flex-volume
        options:
          readonly: "true"
      name: s3
    - flexVolume:
        driver: informaticslab/goofys-flex-volume
        options:
          bucket: "esipfed-scratch"
          dirMode: "0777"
          fileMode: "0777"
      name: scratch
  containers:
  - args:
      - dask-worker
      - --nthreads
      - '2'
      - --no-bokeh
      - --memory-limit
      - 6GB
      - --death-timeout
      - '60'
    image: esip/pangeo-notebook:2018-09-21
    name: dask-worker
    securityContext:
      capabilities:
        add: [SYS_ADMIN]
      privileged: true

    volumeMounts:
    - mountPath: /s3
      name: s3
    - mountPath: /scratch
      name: scratch
    resources:
      limits:
        cpu: "1.75"
        memory: 6G
      requests:
        cpu: "1.75"
        memory: 6G
h4gen commented 5 years ago

@rsignell-usgs

and dropping in info like this:

Can you be a bit more precise on this? How do I incorporate the information? I am pretty new to the whole kubernetes topic so it is not (yet) obvious to me. Also I guess this step has to be completed before setting up the groups with kops edit ig ... correct? otherwise I get the following error:

error reading InstanceGroup "nodes-eu-central-1a-m4-2xlarge. hive.k8s.local": InstanceGroup.kops "nodes-eu-central-1a-m4-2xlarge. hive.k8s.local" not found

here is what I tried. Create a new file culster_settings.yaml with:

...
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
 creationTimestamp: 2018-09-10T13:27:48Z
 labels:
   kops.k8s.io/cluster: hive.k8s.local
 name: nodes-eu-central-1c-m4-2xlarge.hive.k8s.local
spec:
 cloudLabels:
   k8s.io/cluster-autoscaler/enabled: ""
   k8s.io/cluster-autoscaler/node-template/label: ""
   kubernetes.io/cluster/hive.k8s.local: owned
 image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
 machineType: m4.2xlarge
 maxSize: 50
 minSize: 0
 nodeLabels:
   kops.k8s.io/instancegroup: nodes-eu-central-1c-m4-2xlarge.hive.k8s.local
 role: Node
 rootVolumeSize: 120
 subnets:
 - eu-central-1c

and then running:

helm upgrade  autoscaler \
    --namespace kube-system \
    --set image.tag=v1.2.1 \
    --set autoDiscovery.clusterName=hive.k8s.local \
    --set extraArgs.balance-similar-node-groups=false \
    --set extraArgs.expander=random \
    --set rbac.create=true \
    --set rbac.pspEnabled=true \
    --set awsRegion=eu-central-1 \
    --set nodeSelector."node-role\.kubernetes\.io/master"="" \
    --set tolerations[0].effect=NoSchedule \
    --set tolerations[0].key=node-role.kubernetes.io/master \
    --set cloudProvider=aws \
    stable/cluster-autoscaler \
    -f cluster_settings.yaml

Thank you!

rsignell-usgs commented 5 years ago

@h4gen, I meant that I typed kops edit ig and then pasted that info in.

rsignell-usgs commented 5 years ago

One more piece: setting up "pangeo.esipfed.org" to be our endpoint.

  1. On the ESIP github organization, under developer settings, under Oauth apps, we clicked on "new oauth app" and made note of the Client Id and Client Secret: 2018-10-30_11-49-34

2018-10-30_11-50-45

  1. Added those to our secret-config.yaml: 2018-10-30_11-51-16

  2. On networksolutions.com (like godaddy), we set pangeo.esipfed.org to point to the Amazon URL:

    screen shot 2018-09-05 at 1 23 59 pm
aolt commented 5 years ago

Thanks @rsignell-usgs! Was there any special with setting up load balancer on AWS and assigning it to kops cluster?

rsignell-usgs commented 5 years ago

@aolt, you installed the autoscaler helm chart, right? We don't have any extra load balancer that I know of.

aolt commented 5 years ago

@rsignell-usgs yes I did, everything installed nicely and is in running state, but the external-ip. In jupyter-config.yaml one has to specify external IP with loadBalancerIP:. My kops cluster is installed in private VPC, so therefore I assume I have to assign some external IP with AWS, which I can used in jupyter-config.yaml file.

kubectl describe service proxy-public --namespace pangeo
Name:                     proxy-public
Namespace:                pangeo
Labels:                   app=jupyterhub
                          chart=jupyterhub-0.7.0
                          component=proxy-public
                          heritage=Tiller
                          release=jupyter
Annotations:              <none>
Selector:                 component=proxy,release=jupyter
Type:                     LoadBalancer
IP:                       100.67.54.89
IP:                       35.175.192.236
Port:                     http  80/TCP
TargetPort:               8000/TCP
NodePort:                 http  32646/TCP
Endpoints:                100.96.2.5:8000
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                      Age                   From                Message
  ----     ------                      ----                  ----                -------
  Normal   EnsuringLoadBalancer        2m20s (x11 over 27m)  service-controller  Ensuring load balancer
  Warning  CreatingLoadBalancerFailed  2m20s (x11 over 27m)  service-controller  Error creating load balancer (will retry): failed to ensure load balancer for service pangeo/proxy-public: LoadBalancerIP cannot be specified for AWS ELB
aolt commented 5 years ago
  1. On networksolutions.com (like godaddy), we set pangeo.esipfed.org to point to the Amazon URL:
screen shot 2018-09-05 at 1 23 59 pm

I rephrase my question, where do you get this long *.elb.amazonws.com address ? Thanks!

rsignell-usgs commented 5 years ago

@aolt , I assume you figured this out, but when you do a helm install or upgrade, it prints out a statement like this:

You can find the public IP of the JupyterHub by doing:

 kubectl --namespace=esip-pangeo get svc proxy-public

It might take a few minutes for it to appear!
aolt commented 5 years ago

thanks @rsignell-usgs, the issue was I was missing the ingress helm chart, https://github.com/pangeo-data/pangeo/issues/71#issuecomment-435834926