NASA-PDS / devops

Parent repo for PDS DevOps activities
Apache License 2.0
0 stars 0 forks source link

B13.0 Implement Continuous Deployment Strategy #11

Closed jordanpadams closed 1 year ago

jordanpadams commented 2 years ago

๐Ÿ“– Additional Details

Follow-on to https://github.com/NASA-PDS/devops/issues/3 design

jordanpadams commented 2 years ago

:calendar: November status: Delayed state date due to delay in #13

jordanpadams commented 2 years ago

๐Ÿ“† December Status: Delayed start due to work on PLAID and PDS Deep Archive tasks. System build is no dependent upon task completion.

jordanpadams commented 2 years ago

๐Ÿ“† January status: Delayed start due to work on PLAID and PDS Deep Archive tasks. beta dev task so no impact on build deliverables

tloubrieu-jpl commented 2 years ago

@nutjob4life will start by an analysis of that (create a ticket for it)

nutjob4life commented 2 years ago

Continuous Deployment

Implementation

Future Stages

๐Ÿ‘‰ Note: true "demo" may be something we can capture as a story (for the future). Developer staging is something else.

Screen Shot 2022-02-15 at 4.16.45 PM.png

See also: https://github.com/NASA-PDS/pds-registry-app/issues/187

tloubrieu-jpl commented 2 years ago

@ramesh-maddegoda and @nutjob4life need to meet to have the actual registry started by jenkins.

The dev deployment should not run on pds-int so not to conflict with I&T deployments. We want to deploy in on one of pds-devX machines.

jpl-jengelke commented 2 years ago

@nutjob4life An extra node can be added to PDS Jenkins for any of the pds-devX machines. Please contact Rojeh in DSIO.

nutjob4life commented 2 years ago

Request filed

nutjob4life commented 2 years ago

Request satisfied. However, we've got a new issue:

On pds-dev, this command succeeds:

docker container run --rm busybox nslookup -type=A pds-gamma.jpl.nasa.gov

and produces:

Server:     172.16.8.55
Address:    172.16.8.55:53

Non-authoritative answer:
Name:   pds-gamma.jpl.nasa.gov
Address: 128.149.124.6

But using this docker-compose.yaml:

---
version: '3.9'
services:
    demo:
        image: busybox
        command: nslookup -type=A pds-gamma.jpl.nasa.gov.
...

and running docker-compose up, it fails with:

tmp-demo-1  | ;; connection timed out; no servers could be reached
tmp-demo-1  | 
tmp-demo-1 exited with code 1

The difference? The compose version also creates its own network for the services and sets up its own name resolver, but that resolver is then unreachable and cannot forward any requests.

Try it: it'll work on your desktop and on other systems like edrn-docker.jpl.nasa.gov just fine. But on pds-dev.jpl.nasa.gov and on pds-int.jpl.nasa.gov, no dice.

As a result, the Registry API cannot retrieve the initial set of data to harvest and load into Elasticsearch (the curl to pds-gamma times out since it can't resolve the name).

Rojeh says this is because the way Docker is configured on pds-dev (and pds-int), the system is not allowed to alter firewall rules which is needed to make the containerized networks needed by the applications. So Docker is just "sort of" supported on pds-dev, but anything using a Docker Composition is pretty much out. Docker is set up this way because pds-dev (and pds-int) are on "JPLNet", and modifications to firewall rules are forbidden on such systems.

One alternative is to set up a new non-JPLNet host, say pds-deploy, which can run all the continuously deployed services. It would only be accessible from within JPL, which is fine I think. And it would not be on JPLNet, so it'd have the freedom to actually run things the ways they were meant. And since it'd be off of pds-dev, it'd be better isolated from all the other crazy things that go on there.

Thoughts @jordanpadams @tloubrieu-jpl @jpl-jengelke?

jordanpadams commented 2 years ago

@nutjob4life I don't necessarily know enough about this stuff, but per the containerized networks, there is no way to have the SAs preconfigure that in some capacity? It is automated by docker compose?

I guess I don't understand how this isn't possible since people deploy docker containers all the time in operations, with public access no less. So I think I'm just lacking knowledge of why this is so different.

That being said, we can try AWS for this, but at that point, we should probably being using terraform for the registries, and kind of misses the point of a lot of what we are trying to do here.

nutjob4life commented 2 years ago

The difference is: pds-dev (and pds-int) are hamstrung in such a way that it's impossible for Docker to fully work. The system admins prevent Docker from altering iptables that create the virtualized networks containers expect.

Yes, people deploy containers in operations all the timeโ€”just not on such crippled hosts as pds-dev. (Over on EDRN, we deploy containers on non-JPLNet hosts and then provide public access via reverse-proxying. This isolates the iptables modifications.)

How about we deploy a non-JPLNet host, call it pds-deploy, and enjoy a non-enfeebled Docker environment there?

jpl-jengelke commented 2 years ago

Request filed

Is there some way to add me as a watcher, please? I'd like to see what was reported by the sysadmin crew. ...

jpl-jengelke commented 2 years ago

So something happened between the time that I declared it working and today. Although our regression tests continue to run successfully, the test job created, called docker-test-pipeline no longer functions. It looks like permissions changed on the machine in a bad way or something is simply broken. If they hardened it, then they probably broke it. ...

https://pds-jenkins.jpl.nasa.gov/job/docker-test-pipeline/19/console

I can't even pick up a Docker image to test, but I'm not sure that's the real issue. (We can always pick up images from CAE Jenkins.) I guess I need to chat with them.

nutjob4life commented 2 years ago

Is there some way to add me as a watcher, please? I'd like to see what was reported by the sysadmin crew. ...

@jpl-jengelke I shared DSIO-1481 with you; but I also shared the more interesting one, DSIO-1495.

If they hardened it, then they probably broke it.

Looking at the log for build 19 of docker-test-pipeline, the failure doesn't seem to be related to Unix socket permissions, but when they added pds-dev as a job agent. Your pipeline is running on pds-dev instead of pds-int, and pds-dev isn't set up right.

If you go to docker-test-pipeline, go to "Configure", add this to the agent block:

pipeline {
    agent {
        docker { โ€ฆ }
        label 'pds-int'
    }
    โ€ฆ
}

and that might fix it! ๐Ÿ˜‰

jpl-jengelke commented 2 years ago

Just committing an example working configuration here for posterity: https://pds-jenkins.jpl.nasa.gov/job/docker-test-pipeline/27/console

pipeline {
    agent {
        label 'pds-int'
    }
    stages {
        stage('Test') {
            agent {
                docker { 
                    image 'node:16.13.1-alpine' 
                    args '-u 0:0 -e USER=pds4 -e USERNAME=pds4 -e GROUP=pds -v /data:/data:ro'
                    reuseNode true
                }
            }
            steps {
                sh 'node --version'
                ...
jpl-jengelke commented 2 years ago

@nutjob4life Thank you. The node label must go into the docker code block. I think it picked up pds-devX first due to alpha ordering. That nslookup command is now working in a container in my test job on pds-int, see ... https://pds-jenkins.jpl.nasa.gov/job/docker-test-pipeline/28/console

pipeline {
    agent {
        docker { 
            label 'pds-int'
            image 'node:16.13.1-alpine' 
            args '-u 0:0 -e USER=pds4 -e USERNAME=pds4 -e GROUP=pds -v /data:/data:ro'
            reuseNode true
        }
    }
    stages {
        stage('Test') {
            steps {
                sh 'node --version'
                sh 'ls -laF /data/int/tools/'
                sh 'echo "This is a test" >> foo.tst'
                sh 'cat foo.tst'
                sh 'nslookup -type=A pds-gamma.jpl.nasa.gov'
            }
        }
        ...

Of course this doesn't say anything about using compose, but I will try to test that on pds-int next.

jpl-jengelke commented 2 years ago

OK, I think I got it to work by specifying the network mode, opening ports and specifying a bridge network mode in Docker Compose. See the working build here for the success message and the failed build for the error message.

Why did I specify the DNS servers? I vaguely recall working a similar issue with Jeff Liu before he left the Lab. External DNS access was disabled by ITS per a security directive which had some unintended consequences even when running services internally. So now we have to specify DNS in certain cases. Note that I have not tested with host network_mode which might work or be needed in certain circumstances.

If there's a mistake here or I'm way off base, please let me know. But again, this has only been tested in pds-int and should work in pds-devX. (They both seemed to have the same error when I was able to test.) So if it doesn't work then there is a misconfiguration on the new node, I think.

version: '3.8'
services:
    demo:
        image: busybox
        command: nslookup -type=A pds-gamma.jpl.nasa.gov
        network_mode: "bridge"
        ports:
            - "80:80"
            - "53:53"
        dns: 
            - 137.78.160.9 
            - 137.78.160.19 

Test Repo (includes Jenkinsfile and Docker Compose YAML file): https://github.jpl.nasa.gov/jengelke/test-pdsen

nutjob4life commented 2 years ago

Okay, well, โ’œ I didn't know you could do that ๐Ÿ˜‡ and โ’ it does let a service resolve an external name ๐Ÿ˜ฎ

But internal names (service names become hostnames in a Docker Composition) don't resolve. For example:

---
version: '3.9'
services:
    dependent:  # `dependent is a service name and a hostname
        image: busybox
        command: nc -l -v -p 4000 -s 0.0.0.0 -i 5
    demo:
        image: busybox
        entrypoint: /bin/sh
        command: -c 'echo hello | nc -v dependent 4000'  # `dependent` is a hostname and a service name
...         

The demo service calls nc passing the hostname dependent, which Docker resolves to be address for the container also named dependent. I tried adding network_mode, ports, dns and even included Docker's own internal DNS, 127.0.0.11 (which is the pseudo-DNS server that maps hostnames onto service names in a Docker Composition).

What you should get (after 10 seconds):

Starting tmp_dependent_1 ... done
Starting tmp_demo_1      ... done
Attaching to tmp_dependent_1, tmp_demo_1
demo_1       | dependent (172.23.0.2:4000) open
dependent_1  | listening on 0.0.0.0:4000 ...
dependent_1  | connect to 172.23.0.2:4000 from tmp_demo_1.tmp_default:45629 (172.23.0.3:45629)
dependent_1  | hello
tmp_demo_1 exited with code 0
tmp_dependent_1 exited with code 0

What you get on pds-dev:

tmp-demo-1       | nc: bad address 'dependent'

Can you figure out how to get internal hostname (service name) resolution working @jpl-jengelke?

@jordanpadams I still think we should have pds-deploy.jpl.nasa.gov as a new VM not on JPLNet.

jpl-jengelke commented 2 years ago

@nutjob4life @jordanpadams I second the plan to have a separate deploy server if they'll do it.

Can you figure out how to get internal hostname (service name) resolution working @jpl-jengelke?

Probably. But it might involve setting up a combination of an /etc/hosts file inside containers and/or using bridge networking to connect to each other by exposing ports (expose keyword). I suspect it can happen with some detailed setup that maybe assigns internal IPs. I'm happy to look into it if I can make some time later in the week. Maybe take a look at this. The details of the network are available using docker network ls and/or consider using host.docker.internal to connect.

nutjob4life commented 2 years ago

Yeesh, that sounds like it could kind of brittle and would involve services figuring out what IPs got assigned to them and then sharing that (perhaps on a message bussing queue) so each one could adjust its own /etc/hosts and then communicate with the other services it needs. (And then there's bootstrapping the message queue itself!)

And it's not in the spirit of a Docker Composition ๐Ÿ˜‡

Ultimately, the sys admins have given us a hobbled Docker environment that doesn't support the some of the most essential use cases advertised by Docker.

@jordanpadams I really want to push for a separate VM not on JPLNet that can be called pds-deploy or something. We use this pattern on the Early Detection Research Network: various services like edrn.jpl.nasa.gov/cancerdataexpo or edrn.jpl.nasa.gov/portal or mcl.jpl.nasa.gov/portal all reverse-proxy to a host edrn-docker which is not on JPLNet and therefore runs a fully-functional Docker environemnt. No bullcrap.

jpl-jengelke commented 2 years ago

... And it's not in the spirit of a Docker Composition ๐Ÿ˜‡

Ultimately, the sys admins have given us a hobbled Docker environment that doesn't support the some of the most essential use cases advertised by Docker. ...

Not arguing against a separate VM. I've seen environments locked down similarly which is why I think Docker evolved to provide extra controls. We might not need to specify /etc/hosts files if we use host.docker.internal. This could allow us to connect container-to-container running on the same node without a lot of customizations. I don't know if it would be possible to hop between containers outside of one node, but I'm not sure that's the use case here.

jordanpadams commented 2 years ago

@nutjob4life per:

@jordanpadams I really want to push for a separate VM not on JPLNet

Copy. This will have to happen in AWS then (hopefully not too expensive). As far and I understand it, a VM outside of JPLNet requires a DMZ and all kinds of approvals and such that we can't do.

Update: actually thinking about this some more, even getting something setup outside of JPLNet on AWS is going to require some effort. we will need a new public URL waiver for pds-deploy.jpl.nasa.gov. so will need to ping SAs to come up with the checklist of the paperwork that would need to be filed to make this happen.

also, thinking about this, the public VM should probably only be able to perform CD tested/operational tools? it seems like a possible security concern to continuously deploy in-development web services? i feel like there should at least be some "gate" before the service goes public. even if it is some sort of sign-off from I&T or the dev team or someone to trigger that deployment.

jordanpadams commented 2 years ago

๐Ÿ“† February status: Research has been completed. But task will be ongoing into B13.0 tasks. Beta task. No impact on B12.1

nutjob4life commented 2 years ago

Just a quick FYI, a non-JPLNet VM is trivial to set upโ€”even easier than provisioning AWS.

jordanpadams commented 2 years ago

@nutjob4life awesome! let's do it then.

nutjob4life commented 2 years ago

Request filed

jordanpadams commented 2 years ago

:calendar: March status: design and implementation underway. going to defer the rest of the implementation to B13.0. consider this closed, but going to keep it open for now and move to B13.0 because of the detailed conversation above.

jordanpadams commented 2 years ago

moving this out of the Sprint Backlog. See sub-tasks for details

jordanpadams commented 2 years ago

๐Ÿ“† May status: In progress. On schedule

jordanpadams commented 1 year ago

:calendar: June status: Harness investigation ongoing. On schedule

jordanpadams commented 1 year ago

๐Ÿ“† July status: Harness investigation ongoing. Completion delayed awaiting decision on Harness + NGAP + issues with subcontractor transitioning from Columbus to APR.

jordanpadams commented 1 year ago

Pilots of Harness and other tools demo-ed. Going to close out this task as completed for B13.0 and will revisit in a future build when it makes sense to deploy Harness operationally in NGAP.