futurewei-cloud / alcor-control-agent

Cloud native SDN platform - network control agent
MIT License
14 stars 29 forks source link

[Zeta environment setup] Documentation on Zeta+ACA environment setup and test cases #163

Closed er1cthe0ne closed 3 years ago

er1cthe0ne commented 3 years ago

As we are moving into the next phrase of the project. We need to design and setup an environment for Zeta+ACA validation. The request is to create a document which includes the below:

  1. Design of Zeta+ACA testing environment
  2. A picture view of the machine and component setup
  3. Workflow in this environment, e.g. setup ZGC, GoalState send to ACA, confirm traffic that send through ZGC, direct path
  4. Detail test cases

The plan is to have automated test running in this environment based on the current ACA testing framework.

er1cthe0ne commented 3 years ago

@Zqy11 - Please provide an update before our next open source meeting. Thanks.

liangbin-pub commented 3 years ago

First integration goal in Lab environment:

liangbin-pub commented 3 years ago

Sample script code to issue a POST request to Zeta:

response=$(curl -H 'Content-Type: application/json' -X POST \
    -d '{"name":"zgc0",
          "description":"zgc0",
          "ip_start":"20.0.0.1",
          "ip_end":"20.0.0.15",
          "port_ibo":"8300"}' \
      172.16.62.247:8080/zgcs)
HuaqingTu commented 3 years ago

I install ACA on a physical machine. But when I run "./build/bin/AlcorControlAgent ", it turns out "Segmentation fault (core dumped)". Will it affects subsequent tests?

liangbin-pub commented 3 years ago

@HuaqingTu

Before Eric jumps in, can you provide some additional info: What os is it, ubuntu18? What are the steps you took from beginning? Are you following the ACA build procedure? If it's one of the lab servers, please let us know just the ip of it Thanks,

Bin

HuaqingTu commented 3 years ago

@HuaqingTu

Before Eric jumps in, can you provide some additional info: What os is it, ubuntu18? What are the steps you took from beginning? Are you following the ACA build procedure? If it's one of the lab servers, please let us know just the ip of it Thanks,

Bin

  1. OS is Ubuntu 18.04.4 LTS.
  2. I copy the files of ACA on my own computer to Computer 18 and Computer 19 to reduce download time. And I also change the shell file to use “https://hib.fastgit.org" to speed download. There are 11 test cases failed, and the reason is about ovs bridge.
  3. Computer 17 (39.98.115.249:8247) and Computer 18 (39.98.115.249:8248) are for gateways. Computer 19 (39.98.115.249:8249) and Computer 20 (39.98.115.249:8250) are for computer nodes.
liangbin-pub commented 3 years ago

in 2 above, you mean 19 and 20,right? On 19 & 20, do you have OVS installed? Running alcor-control-agent and tests Install OVS in ubuntu (18.04) if needed:

sudo apt install openvswitch-switch

If you start a new container, you may need below after installing OVS:

sudo /etc/init.d/openvswitch-switch restart

sudo ovs-vswitchd --pidfile --detach

Follow the build and test procedure in getting start guide

er1cthe0ne commented 3 years ago

@HuaqingTu - after following the getting start guide and setup OVS. Are you able to run ./build/bin/AlcorControlAgent and ./build/tests/aca_tests now?

HuaqingTu commented 3 years ago

@HuaqingTu - after following the getting start guide and setup OVS. Are you able to run ./build/bin/AlcorControlAgent and ./build/tests/aca_tests now?

It worked!

PikaPikaW commented 3 years ago

When I installe Zeta, I executed "./deploy/ full_deploy.sh-d kind" command, An error occurred while creating k8S cluster with kind. The error message is as follows:

TASK [Setting up Kind cluster] ***********************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "../kind/create_cluster.sh development 2 3 &>> /tmp/ansible_debug.log", "delta": "0:00:00.001796", "end": "2020-11-25 21:29:50.701291", "msg": "non-zero return code", "rc": 126, "start": "2020-11-25 21:29:50.699495", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
liangbin-pub commented 3 years ago

can you do this cat /tmp/ansible_debug.log

PikaPikaW commented 3 years ago

Hello,the following problem occurred when i install Zeta:

TASK [Deploy zeta-manager service] ************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "../install/deploy_zeta_manager.sh &>>/tmp/ansible_debug.log", "delta": "0:05:53.159439", "end": "2020-11-27 16:52:04.018989", "msg": "non-zero return code", "rc": 1, "start": "2020-11-27 16:46:10.859550", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP ************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   

and cat /tmp/ansible_debug.log,log information is as follows:

Deleting cluster "kind" ...
Deleting existing zeta-node containers
Rebuild and publish zeta_node image to localhost:5000...
Rebuild and publish zeta_droplet image to localhost:5000...
Creating zeta-node-1
44ea45b3388c3e97f8b5aa0cd0f64b7deb2ff06515a4f9e416e59e47b525b01b
Creating zeta-node-2
e624b0db694b62e88fbde6e52e684e63a7013292f53886576349c9b059bf2d3f
Creating zeta-node-3
bf83a6223f0a37f546fa5abe718185c5120affb75ea02f82fc2062170003dc45
Creating cluster "kind" ...
 • Ensuring node image (localhost:5000/zeta_node:latest) 🖼  ...
 ✓ Ensuring node image (localhost:5000/zeta_node:latest) 🖼
 • Preparing nodes 📦   ...
 ✓ Preparing nodes 📦 
 • Writing configuration 📜  ...
 ✓ Writing configuration 📜
 • Starting control-plane 🕹️  ...
 ✓ Starting control-plane 🕹️ 
 • Installing CNI 🔌  ...
 ✓ Installing CNI 🔌
 • Installing StorageClass 💾  ...
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind --kubeconfig /root/.kube/config.kind

Not sure what to do next? 😅  Check out https://kind.sigs.k8s.io/docs/user/quick-start/
configmap/local-registry-hosting created
Rebuild zeta-operator image...
Rebuild zeta-manager image...
customresourcedefinition.apiextensions.k8s.io/chains.zeta.com created
customresourcedefinition.apiextensions.k8s.io/dfts.zeta.com created
customresourcedefinition.apiextensions.k8s.io/droplets.zeta.com created
customresourcedefinition.apiextensions.k8s.io/ftns.zeta.com created
customresourcedefinition.apiextensions.k8s.io/fwds.zeta.com created
Creating the zeta-operator deployment and pod...
serviceaccount/zeta-operator created
clusterrolebinding.rbac.authorization.k8s.io/zeta-operator created
deployment.apps/zeta-operator created
Creating the zeta-manager deployment and service...
deployment.apps/zeta-manager created
service/zeta-manager created
pod/zeta-manager-8d97bc4dc-cl8r2 condition met
Waiting for postgres service ready for connection......................
...............................................Time out after 300s
liangbin-pub commented 3 years ago

Are you using 172.16.62.247, 172.16.62.248? I can't access them. The 249 & 250 seems for ACA only

PikaPikaW commented 3 years ago

Yes, I installed Zeta on 247 and 248, but I am not sure whether Zeta is installed or not. After running ./deploy/ full_deploy.sh -d kind, the above mentioned problems were printed out.Using POST http://172.16.62.247:8080/zgcs didn't get response.I copied down some of the information after I ran full_deploy.sh.

host 247 The running container information is as follows:

IMAGE                                NAMES                 PORTS
localhost:5000/zeta_node:latest      kind-control-plane    0.0.0.0:443->443/tcp, 0.0.0.0:8080->80/tcp, 127.0.0.1:45417->6443/tcp

localhost:5000/zeta_droplet:latest   zeta-node-3           
localhost:5000/zeta_droplet:latest   zeta-node-2           
localhost:5000/zeta_droplet:latest   zeta-node-1           
registry:2                           local-kind-registry   0.0.0.0:5000->5000/tcp
zeta_build:latest                    zb

host 247 images information:

REPOSITORY                    TAG                 SIZE
localhost:5000/zeta_opr       latest              1.11GB
localhost:5000/zeta_droplet   latest              1.98GB
localhost:5000/zeta_node      latest              1.75GB
localhost:5000/zeta_manager   latest              247MB
zeta_build                    latest              1.92GB
fwnetworking/zeta_dev         latest              1.92GB

running lsof -i:8080 output:

COMMAND      PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
docker-pr 452036 root    4u  IPv6 6083581      0t0  TCP *:http-alt (LISTEN)

I think the installation script should run to deploy_zeta_manager.sh, stopped below:

REGISTRY="$REG" \
envsubst '$REGISTRY' < $DEPLOYMENTS_PATH/zeta-manager-deployment.yml > $DEPLOYMENTS_PATH/.zeta-manager-deployment.yml
kubectl apply -f $DEPLOYMENTS_PATH/.zeta-manager-deployment.yml
kubectl apply -f $DEPLOYMENTS_PATH/zeta-manager-service.yml
kubectl wait --for=condition=ready pod -l app=zeta-manager --timeout=300s

echo -n "Waiting for postgres service ready for connection..."
POD_ZM="$(kubectl get pod --field-selector status.phase=Running -l app=zeta-manager -o jsonpath='{.items[0].metadata.name}')"
end=$((SECONDS + 300))
ready="Not Ready"
while [[ $SECONDS -lt $end ]]; do
    ready="$(kubectl exec $POD_ZM -- cat /tmp/healthy 2>&1 | head -n1)"
    if [ -z "$ready" ]; then
        ready="ready"
        break
    fi
    echo -n "."
    sleep 2
done
if [ "$ready" != "ready" ]; then
    echo "Time out after 300s"
    exit 1
fi

So what should I do next?Or how do I make sure that Zeta is installed?

liangbin-pub commented 3 years ago

Please check why the remote connection to 247 and 248 not working, I need to ssh onto these two: 39.98.115.249:8247 39.98.115.249:8248 ssh access not working according to the instruction sent to me before please don't send username/password here, if changed, send me through email. Also, the problem seems on 39.98.115.249, connection to 8247, 8248 are rejected.

PikaPikaW commented 3 years ago

Sorry, some of my wrong operations caused the remote login failure, now 8247 should be restored, 8248 is still not available

liangbin-pub commented 3 years ago

@PikaPikaW There are a few issues in 247 environment:

  1. Should NOT use root to build and deploy, just use normal user (sdn), I have fixed access issues caused by this
  2. There are problem fetching container images for postgres and nginx which causes zeta services deployment timed out

default pod/postgres-7875689b5-q4cpz 0/1 ContainerCreating 0 8m55s default pod/zeta-manager-8d97bc4dc-qlkql 0/1 ContainerCreating 0 8m48s Is this related to the issue on downloading from mirror site?

liangbin-pub commented 3 years ago

I did a manual image pull for postgres, it's super slow, the deployment will certainly fail. Since all Zeta services are locally built, they will load fast but the postgres and ingress-nginx need to be accelerated: maybe manually pull them then push to local registry (existing localhost:5000). Then the yaml files needs to be modified to point image to where you pushed the two images, see: deploy/install/deploy_postgres.sh and deploy_ingress_nginx.sh

PikaPikaW commented 3 years ago

ok,Therefore, I need to pull and push these images to the Local Registry locally, then change the script to use the image of local push, and cannot use root to build and deploy.

Some doubt: What specific images do I need to pull locally?I found postgres:12.1-alpine, but another one wasn't found in the deploy_ingress_nginx.sh script. There is a website,but I can't open it and do I need VPN?

# deploy_ingress_nginx.sh
echo "Create Nginx Ingress Controller..."
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/kind/deploy.yaml >/dev/null

# Remove unnecessary validation through webhooks when new ingresses are added
kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission &>/dev/null

Is the image of Zeta Service already pushed to the local Registry, so it is loaded quickly? Can I lengthen the 300s to ensure that the image is loaded successfully ? I don't know how to change the YAML file, so I wonder if it is ok to change this time?

liangbin-pub commented 3 years ago

I found the mirror site for above yaml (from here https://blog.csdn.net/networken/article/details/105122778): https://raw.sevencdn.com/kubernetes/ingress-nginx/master/deploy/static/provider/kind/deploy.yaml

You can try extend the time out in deploy_zeta_manager.sh but that may not address the issue if deployment yaml or image is not accessible.

PikaPikaW commented 3 years ago

hello,in deploy_postgres.sh, all YML files used found only one image postgres:12.1-alpine need pull in Postgres-Deployment, so I pull image postgres locally, tag it localhost:5000/postgres, and push so that it should pull locally, And in postgres-deployment.yml ,change the image point to localhost:5000/postgres:latest

 spec:
            containers:
      - name: postgres
        image: localhost:5000/postgres:latest
        env:
          - name: POSTGRES_USER
            valueFrom:
              secretKeyRef:
                name: postgres-credentials
                key: user
          - name: POSTGRES_PASSWORD
            valueFrom:

Similarly, deploy_ingress_nginx. Sh, I put the url yml file stored in the local ./deploy/etc/deployments, named ingress-nginx-deployment.yml, then register locally the images the file need pull.As follows:

k8s.gcr.io/ingress-nginx/controller:v0.41.2@sha256:1f4f402b9c14f3ae92b11ada1dfe9893a88f0faeb0b2f4b903e2c67a0c3bf0de
registered as
localhost:5000/ingress-nginx-controller:latest

docker.io/jettech/kube-webhook-certgen:v1.5.0
registered as
localhost:5000/kube-webhook-certgen:latest

Then change the image point to in the file But at the end,the installation still print timeout. I don't know if I'm doing this right?

liangbin-pub commented 3 years ago

I checked 247, the problem is only postgres now, the pod is not there because there is a small error in postgres-deployment.yml: the indent of "containers" was changed wrong causing:

sdn@computer17:~/Zeta/zeta$ ./deploy/install/deploy_postgres.sh Creating the volume... persistentvolume/postgres-pv unchanged persistentvolumeclaim/postgres-pvc unchanged Creating the database credentials... secret/postgres-credentials unchanged Creating the postgres deployment and service... error: error parsing /home/sdn/Zeta/zeta/deploy/install/../etc/deployments/postgres-deployment.yml: error converting YAML to JSON: yaml: line 28: did not find expected key service/postgres unchanged

I fixed this part and deploy again, zeta-manager still not up, check the log shows:

sdn@computer17:~/Zeta/zeta$ kubectl logs zeta-manager-8d97bc4dc-gc2r4 standard_init_linux.go:211: exec user process caused "no such file or directory"

Since we never hit such kind of error, I checked the diff in your repo and noticed all files were modified with windows-style line/file ending. I will check which one caused the problem

liangbin-pub commented 3 years ago

Seems all files are affected by windows style line ending, I fixed with

find . -type f -print0 | xargs -0 dos2unix --

and did a full deploy, it deploys successfully now, you can access zeta NBI API through port 8080:

sdn@computer17:~/Zeta/zeta$ curl http://localhost:8080/zgcs [ { "description": "zgc0", "id": 1, "ip_end": "20.0.0.255", "ip_start": "20.0.0.0", "name": "zgc0", "nodes": [], "overlay_type": "vxlan", "port_ibo": 8300, "vpcs": [], "zgc_id": "5b2e21d3-9418-4468-8d51-c513861bfdf5" } ]

liangbin-pub commented 3 years ago

So mainly in 247 there are three issues deploying in your environment:

  1. Avoid using root user to build and deploy
  2. Fix accessibility to some yaml/images using mirror sites (Good work!)
  3. Avoid adding windows style file/line endings, docker has issue with it
PikaPikaW commented 3 years ago

Now that Zeta has been installed and Zeta's interface is available, I think I need to read ACA gtest and learn how RPC works. Is there any RPC script that has been written to access ACA?Can you send a link,please

er1cthe0ne commented 3 years ago

@PikaPikaW - great progress to have Zeta installed. For ACA gtest, you can take a look at /test/gtest/aca_test_ovs_l2.cpp, DISABLED_2_ports_CREATE_test_traffic_PARENT and DISABLED_2_ports_CREATE_test_traffic_CHILD to see how to setup the goal state and do traffic testing. Execution instruction is on top of the file or https://github.com/futurewei-cloud/alcor-control-agent/wiki/How-to-run-the-full-suite-of-aca_tests @zhangml started modifying and running the gtest already.

er1cthe0ne commented 3 years ago

current documentation has been merged with #173