MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
30 stars 4 forks source link

Updating Polykey infrastructure to replace fargate with EC2 instances. #486

Closed tegefaulkes closed 1 year ago

tegefaulkes commented 2 years ago

Specification

To simplify the infrastructure on AWS we're looking at replacing the farscape usage with EC2 instances to host nodes. This won't translate to code changes within the Polykey repo. This issue is just to document the process and progress.

Additional context

Tasks

  1. setup a EC2 instance and configure it to work with the ECS we're using to spawn node containers.
  2. verify it's working by controlling the spawned agent with a client connection.
  3. verify functionality by entering the polykey network over agent-agent communication.
tegefaulkes commented 2 years ago

Reference material

tegefaulkes commented 2 years ago

This can be completed in stages. First we can set up the EC2 instance with ECS manually. after that we can look at modifying the polykey infrastructure repository infrastructure as code.

tegefaulkes commented 2 years ago

Looks like AWS let's you create a new ECS cluster with the EC2 instances. We can specify the number of instances and instance type. It should do all the setup for us and be much quicker for testing than setting it all up via the CLI IAC setup.

@CMCDragonkai have you seen this already? is there a reason you went with the manual set up? Do we want to entertain this method for quickly setting up the testnet for testing?

For reference https://docs.aws.amazon.com/AmazonECS/latest/developerguide/launch_types.html?icmpid=docs_ecs_/chapter/docs_ecs_hp-create-cluster-infrastructure

tegefaulkes commented 2 years ago

Here's a run sheet for setting up a ECS EC2 cluster using the AWS CLI. Should be a very useful reference for setting it up via the infrastructure as code method.

tegefaulkes commented 1 year ago

in terms of regisering instance IP addresses (or container IP addresses/ see the host network usage) - at some point the the instance or container will have a PUBLIC IP so we kind of want "dynamic DNS" for the container IP (even if it is the inherited host IP) so a couple of ways of doing this:

  1. dynamic dns at the ec2 instance level (but this isn't entirely robust)
  2. aws lambda with cloudwatch events on the ecs cluster events
  3. aws lambda with cloudwatch events on the ec2 instance events
  4. using route53 - then cname to cloudflare
  5. direct to cloudflare - with some script that calls the API with curl - could use userdata script to setup boot scripts for systemd
  6. some other system that understands ecs cluster events
  7. worth checking out "ECS service discovery", if all the container IPs (even if inherited from host) can be registered under some public DNS

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/dynamic-dns.html

Service discovery won't work. It's for private IPs.

someone went to long lengths to make this possible https://github.com/aws/aws-cdk/pull/10646

tegefaulkes commented 1 year ago

I can connect on a client connection!

[nix-shell:~/matrixcode/polykey/js-polykey]$ npm run polykey -- agent status --client-host 54.253.185.58 --client-port 1315 --node-id vcsbapctn5pljtgj97pd6i5eu33ub2foc9be8jfnfa1q3q76il820

> polykey@1.0.1-alpha.0 polykey
> ts-node src/bin/polykey.ts "agent" "status" "--client-host" "54.253.185.58" "--client-port" "1315" "--node-id" "vcsbapctn5pljtgj97pd6i5eu33ub2foc9be8jfnfa1q3q76il820"

✔ Please enter the password … ******
status  LIVE
pid     1
nodeId  vcsbapctn5pljtgj97pd6i5eu33ub2foc9be8jfnfa1q3q76il820
clientHost      0.0.0.0
clientPort      1315
proxyHost       0.0.0.0
proxyPort       1314
agentHost       127.0.0.1
agentPort       46463
forwardHost     127.0.0.1
forwardPort     36081
rootPublicKeyPem        -----BEGIN PUBLIC KEY-----
        MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAgN4HRp8DDjzXmZPiusyR
        5vkY+4D5ejOJQtf/pLliszCMdls50BNyED+1nDV/gVoP5pCA0JN2P0vlJBkPg9Rn
        GWD5qNBsvo9lCijBLsXLSoiqw3Z4tAP3/1C5tnbbS/wFA9STlYJa3kxSesWbaUE7
        Cv67OOE3LASH5ifMNHI2JWauRGEWOw0zTJdEG3iJcd1WGdvtk30Kp8kBNMbLUbN+
        Wt0wnl9QCjE16OqAb+3B6RLWYGkr96evVuVAjuNZVYMYhlyi41KCnOiEndc1ORr9
        /NhIWja/+Z7pief+mFR/4tTDxOIp1xnkCnK+ifuBHoyzMvHDlq7adWIvVR3VV0RV
        ROV+e4PWEt/srec0qN2KS6smh+2WpPbLHE+iDz5pCoFw0/KnH4o/PNgt7lIClS0D
        zhZ/lYkcEhbO31/81okPv4C2tpfM0E5GBChSGV2UMrYfhGhja50TPxVml2m7LZzJ
        WjyFsER76zInHEr1qYthSxZXjerINYbMmZ1yMkdhGTVMTepMFBaYsXWktD1zrDAz
        GvSx1T2RQgy9X50CuTpOybMxSlXeIS6h/UpVGyQj92UCYyf5O0wLCcSmhLgr0pqC
        n6uIbSUmJqHy14svKv1CMHYqZCIcgBamsn6imrvMlbIuUtyd5ageuWgf5efy47sn
        TJ3MtHGs0d1x+xHjSjGRsT8CAwEAAQ==
        -----END PUBLIC KEY-----
rootCertPem     -----BEGIN CERTIFICATE-----
        MIIIKDCCBhCgAwIBAgIFFmYlJpcwDQYJKoZIhvcNAQELBQAwQDE+MDwGA1UEAxM1
        dmNzYmFwY3RuNXBsanRnajk3cGQ2aTVldTMzdWIyZm9jOWJlOGpmbmZhMXEzcTc2
        aWw4MjAwHhcNMjIxMDIwMDc1ODE3WhcNMjMxMDIwMDc1ODE3WjBAMT4wPAYDVQQD
        EzV2Y3NiYXBjdG41cGxqdGdqOTdwZDZpNWV1MzN1YjJmb2M5YmU4amZuZmExcTNx
        NzZpbDgyMDCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAIDeB0afAw48
        15mT4rrMkeb5GPuA+XoziULX/6S5YrMwjHZbOdATchA/tZw1f4FaD+aQgNCTdj9L
        5SQZD4PUZxlg+ajQbL6PZQoowS7Fy0qIqsN2eLQD9/9QubZ220v8BQPUk5WCWt5M
        UnrFm2lBOwr+uzjhNywEh+YnzDRyNiVmrkRhFjsNM0yXRBt4iXHdVhnb7ZN9CqfJ
        ATTGy1GzflrdMJ5fUAoxNejqgG/twekS1mBpK/enr1blQI7jWVWDGIZcouNSgpzo
        hJ3XNTka/fzYSFo2v/me6Ynn/phUf+LUw8TiKdcZ5Apyvon7gR6MszLxw5au2nVi
        L1Ud1VdEVUTlfnuD1hLf7K3nNKjdikurJoftlqT2yxxPog8+aQqBcNPypx+KPzzY
        Le5SApUtA84Wf5WJHBIWzt9f/NaJD7+AtraXzNBORgQoUhldlDK2H4RoY2udEz8V
        Zpdpuy2cyVo8hbBEe+syJxxK9amLYUsWV43qyDWGzJmdcjJHYRk1TE3qTBQWmLF1
        pLQ9c6wwMxr0sdU9kUIMvV+dArk6TsmzMUpV3iEuof1KVRskI/dlAmMn+TtMCwnE
        poS4K9Kagp+riG0lJiah8teLLyr9QjB2KmQiHIAWprJ+opq7zJWyLlLcneWoHrlo
        H+Xn8uO7J0ydzLRxrNHdcfsR40oxkbE/AgMBAAGjggMnMIIDIzAMBgNVHRMEBTAD
        AQH/MAsGA1UdDwQEAwIC9DA7BgNVHSUENDAyBggrBgEFBQcDAQYIKwYBBQUHAwIG
        CCsGAQUFBwMDBggrBgEFBQcDBAYIKwYBBQUHAwgwEQYJYIZIAYb4QgEBBAQDAgD3
        MFgGA1UdEQRRME+CNXZjc2JhcGN0bjVwbGp0Z2o5N3BkNmk1ZXUzM3ViMmZvYzli
        ZThqZm5mYTFxM3E3NmlsODIwhwR/AAABhxAAAAAAAAAAAAAAAAAAAAABMB0GA1Ud
        DgQWBBQtHU/7Cs0Xq1bvNiQw3uvjNotcADAhBgsrBgEEAYO+TwICAQEB/wQPVg0x
        LjAuMS1hbHBoYS4wMIICGAYLKwYBBAGDvk8CAgIBAf8EggIERIICAASSyrCDWeT3
        T/y/VIjzqjqIV/qbe1n0Ilycww11NALOVwGY3wPV3qayMDkJ/kDimf7g1jKkq+XK
        FXb2p96jqBImHCvzH+RycHxq+fKNSxt3r+yXLdMPl5/D1eFDtU1uHXpqdUCGDMgT
        J3i1fzH8JmOB+mzP3TURfEMehwXG3wxxw96wQbBo4LxENkHW6Bu+fqVjLdDvtDq1
        a5HumBF02LyLpYHrIcCJSC9K7Fv2K0ActbX7EPapPSona9O4Nznv36Y/31F3SCiH
        rqC/zjX5OqDl/jf0umqiXA9Ze46d5K2TU2pQvul5eAA4Ke/opmtIYTt4kvzyF1vV
        J9XdkZsrRZKn53OxEDwhCJqqoVBbEbq2c5Wmzvuv6YjI+COIC/kReQp/kW4sya8I
        EEVNXz65JIgZzsuwoicq+ZN0vCvh6x/nAr9IFLDc22w8S81L3X1A6P+AaHWsGfEm
        wSkn42nGvL32ltWwCjwPSoaEF73yBuRzWL2wL9BDVhvoq6f5eMf8SsinM7IeAgTo
        ftMP6gLjeOcPva5fIPFkQM7x4ybDMYtNyjsKycdHA0rs1oPis4qOKI0Uvzqaxx2t
        jySIVZ22ZentVLbuSDivoj3DjWrWORWqpKgz2Qn6mJYAgPFFPS6NE9rIKEjCAC/J
        dNh/H5D1DoVcyBZV17zMo1vU45Xao7l8MA0GCSqGSIb3DQEBCwUAA4ICAQAfegLf
        c+rZ2AVZOO2g0pO4z1BpgAc0/HQdsPZxGZX963FsO06xw+CNqwGjPxCTMGOohf/L
        snWu1Cq73Z8RBzWtH3BMb/4PeD2H5UdHzlp3MVac4tT9X8IYjYeOsZL3axuWynt0
        4vwXUE7+yBkMnDsxJHZzBX8pfSpoVPc7aZ1/7RMJB0mCoHXydF7dgJngFJmrBfb7
        qMhaegmbOEC9og1x+Kfa8pAqynWrkcXgTS1mOezHGItJg2v1+YPjslxcZqqliPwK
        28QPcH9xI6pdgne6zzz5f8dTTSpRbiVK2K3m25jPqYiPklBYfQ1NrBEStVFGieMj
        whRpJqEA1iqBCgJCJi4Qe0GaMM6sx+pf4Dd7MlYrFHndqQS7Ty7oMv46GPhwhh/W
        mGAnU690Buhvr8U9CEm6BD8WHlmLtMtPWMgp/73HNK13nONXN4Skbq9U8wYxLR70
        oijphtKM7zPjn5DOP098JUW/6LBKOVZnOJihU+v/dW+7S2vfjniUCdXgGfZoovIi
        kb8DDnDhc/S5y3ZuGlBE66E166g3264qpDJ3zSCNscQ8DTto8gdJQ+TXMrgt9sE9
        1/w9/ad/8UA0mZeiOtIg+/DSXIOx+bW2kg8/Ms6QNElzkazJXoVq2D20btwO6XO7
        8ohGoCFgXoO2H9Az+95yWqvuKt9930oKVDV4rg==
        -----END CERTIFICATE-----
CMCDragonkai commented 1 year ago

This should be closed upon merge of https://gitlab.com/MatrixAI/Engineering/Polykey/Polykey-Infrastructure/-/merge_requests/2

tegefaulkes commented 1 year ago

That has been merged. I'm closing this.

CMCDragonkai commented 1 year ago

Great, now we have 8x the performance and 2.5x cheaper. Further improvements can be done by reserving the t3a.nano instances. But we will do that after we observe performance characteristics.

CMCDragonkai commented 1 year ago

Note that our EC2 instances are not connected to cloudwatch logs. We have the agent installed, but it isn't connected up yet. Would be important in case storage, CPU usage, network and startup logs should be watched. That would be later.

CMCDragonkai commented 1 year ago

image

This is if we reserve 1 T3a nano instances each for 1 year.

But before we go into reservations, we will need to benchmark them to see how many are needed and compare to more of small instances, or less of larger instances.