Closed jordanpadams closed 1 year ago
:calendar: November status: Delayed state date due to delay in #13
๐ December Status: Delayed start due to work on PLAID and PDS Deep Archive tasks. System build is no dependent upon task completion.
๐ January status: Delayed start due to work on PLAID and PDS Deep Archive tasks. beta dev task so no impact on build deliverables
@nutjob4life will start by an analysis of that (create a ticket for it)
pds-dev.jpl.nasa.gov
pds.nasa.gov/demo
pds-int
@jpl-jengelke is ahead of the curve on this front๐ Note: true "demo" may be something we can capture as a story (for the future). Developer staging is something else.
See also: https://github.com/NASA-PDS/pds-registry-app/issues/187
@ramesh-maddegoda and @nutjob4life need to meet to have the actual registry started by jenkins.
The dev deployment should not run on pds-int so not to conflict with I&T deployments. We want to deploy in on one of pds-devX machines.
@nutjob4life An extra node can be added to PDS Jenkins for any of the pds-devX machines. Please contact Rojeh in DSIO.
Request satisfied. However, we've got a new issue:
On pds-dev
, this command succeeds:
docker container run --rm busybox nslookup -type=A pds-gamma.jpl.nasa.gov
and produces:
Server: 172.16.8.55
Address: 172.16.8.55:53
Non-authoritative answer:
Name: pds-gamma.jpl.nasa.gov
Address: 128.149.124.6
But using this docker-compose.yaml
:
---
version: '3.9'
services:
demo:
image: busybox
command: nslookup -type=A pds-gamma.jpl.nasa.gov.
...
and running docker-compose up
, it fails with:
tmp-demo-1 | ;; connection timed out; no servers could be reached
tmp-demo-1 |
tmp-demo-1 exited with code 1
The difference? The compose
version also creates its own network for the services and sets up its own name resolver, but that resolver is then unreachable and cannot forward any requests.
Try it: it'll work on your desktop and on other systems like edrn-docker.jpl.nasa.gov
just fine. But on pds-dev.jpl.nasa.gov
and on pds-int.jpl.nasa.gov
, no dice.
As a result, the Registry API cannot retrieve the initial set of data to harvest and load into Elasticsearch (the curl
to pds-gamma
times out since it can't resolve the name).
Rojeh says this is because the way Docker is configured on pds-dev
(and pds-int
), the system is not allowed to alter firewall rules which is needed to make the containerized networks needed by the applications. So Docker is just "sort of" supported on pds-dev
, but anything using a Docker Composition is pretty much out. Docker is set up this way because pds-dev
(and pds-int
) are on "JPLNet", and modifications to firewall rules are forbidden on such systems.
One alternative is to set up a new non-JPLNet host, say pds-deploy
, which can run all the continuously deployed services. It would only be accessible from within JPL, which is fine I think. And it would not be on JPLNet, so it'd have the freedom to actually run things the ways they were meant. And since it'd be off of pds-dev
, it'd be better isolated from all the other crazy things that go on there.
Thoughts @jordanpadams @tloubrieu-jpl @jpl-jengelke?
@nutjob4life I don't necessarily know enough about this stuff, but per the containerized networks, there is no way to have the SAs preconfigure that in some capacity? It is automated by docker compose?
I guess I don't understand how this isn't possible since people deploy docker containers all the time in operations, with public access no less. So I think I'm just lacking knowledge of why this is so different.
That being said, we can try AWS for this, but at that point, we should probably being using terraform for the registries, and kind of misses the point of a lot of what we are trying to do here.
The difference is: pds-dev
(and pds-int
) are hamstrung in such a way that it's impossible for Docker to fully work. The system admins prevent Docker from altering iptables that create the virtualized networks containers expect.
Yes, people deploy containers in operations all the timeโjust not on such crippled hosts as pds-dev
. (Over on EDRN, we deploy containers on non-JPLNet hosts and then provide public access via reverse-proxying. This isolates the iptables modifications.)
How about we deploy a non-JPLNet host, call it pds-deploy
, and enjoy a non-enfeebled Docker environment there?
Is there some way to add me as a watcher, please? I'd like to see what was reported by the sysadmin crew. ...
So something happened between the time that I declared it working and today. Although our regression tests continue to run successfully, the test job created, called docker-test-pipeline
no longer functions. It looks like permissions changed on the machine in a bad way or something is simply broken. If they hardened it, then they probably broke it. ...
https://pds-jenkins.jpl.nasa.gov/job/docker-test-pipeline/19/console
I can't even pick up a Docker image to test, but I'm not sure that's the real issue. (We can always pick up images from CAE Jenkins.) I guess I need to chat with them.
Is there some way to add me as a watcher, please? I'd like to see what was reported by the sysadmin crew. ...
@jpl-jengelke I shared DSIO-1481 with you; but I also shared the more interesting one, DSIO-1495.
If they hardened it, then they probably broke it.
Looking at the log for build 19 of docker-test-pipeline
, the failure doesn't seem to be related to Unix socket permissions, but when they added pds-dev
as a job agent. Your pipeline is running on pds-dev
instead of pds-int
, and pds-dev
isn't set up right.
If you go to docker-test-pipeline
, go to "Configure", add this to the agent
block:
pipeline {
agent {
docker { โฆ }
label 'pds-int'
}
โฆ
}
and that might fix it! ๐
Just committing an example working configuration here for posterity: https://pds-jenkins.jpl.nasa.gov/job/docker-test-pipeline/27/console
pipeline {
agent {
label 'pds-int'
}
stages {
stage('Test') {
agent {
docker {
image 'node:16.13.1-alpine'
args '-u 0:0 -e USER=pds4 -e USERNAME=pds4 -e GROUP=pds -v /data:/data:ro'
reuseNode true
}
}
steps {
sh 'node --version'
...
@nutjob4life Thank you. The node label must go into the docker
code block. I think it picked up pds-devX
first due to alpha ordering. That nslookup command is now working in a container in my test job on pds-int
, see ...
https://pds-jenkins.jpl.nasa.gov/job/docker-test-pipeline/28/console
pipeline {
agent {
docker {
label 'pds-int'
image 'node:16.13.1-alpine'
args '-u 0:0 -e USER=pds4 -e USERNAME=pds4 -e GROUP=pds -v /data:/data:ro'
reuseNode true
}
}
stages {
stage('Test') {
steps {
sh 'node --version'
sh 'ls -laF /data/int/tools/'
sh 'echo "This is a test" >> foo.tst'
sh 'cat foo.tst'
sh 'nslookup -type=A pds-gamma.jpl.nasa.gov'
}
}
...
Of course this doesn't say anything about using compose, but I will try to test that on pds-int
next.
OK, I think I got it to work by specifying the network mode, opening ports and specifying a bridge network mode in Docker Compose. See the working build here for the success message and the failed build for the error message.
Why did I specify the DNS servers? I vaguely recall working a similar issue with Jeff Liu before he left the Lab. External DNS access was disabled by ITS per a security directive which had some unintended consequences even when running services internally. So now we have to specify DNS in certain cases. Note that I have not tested with host
network_mode
which might work or be needed in certain circumstances.
If there's a mistake here or I'm way off base, please let me know. But again, this has only been tested in pds-int
and should work in pds-devX
. (They both seemed to have the same error when I was able to test.) So if it doesn't work then there is a misconfiguration on the new node, I think.
version: '3.8'
services:
demo:
image: busybox
command: nslookup -type=A pds-gamma.jpl.nasa.gov
network_mode: "bridge"
ports:
- "80:80"
- "53:53"
dns:
- 137.78.160.9
- 137.78.160.19
Test Repo (includes Jenkinsfile and Docker Compose YAML file): https://github.jpl.nasa.gov/jengelke/test-pdsen
Okay, well, โ I didn't know you could do that ๐ and โ it does let a service resolve an external name ๐ฎ
But internal names (service names become hostnames in a Docker Composition) don't resolve. For example:
---
version: '3.9'
services:
dependent: # `dependent is a service name and a hostname
image: busybox
command: nc -l -v -p 4000 -s 0.0.0.0 -i 5
demo:
image: busybox
entrypoint: /bin/sh
command: -c 'echo hello | nc -v dependent 4000' # `dependent` is a hostname and a service name
...
The demo
service calls nc
passing the hostname dependent
, which Docker resolves to be address for the container also named dependent
. I tried adding network_mode
, ports
, dns
and even included Docker's own internal DNS, 127.0.0.11
(which is the pseudo-DNS server that maps hostnames onto service names in a Docker Composition).
What you should get (after 10 seconds):
Starting tmp_dependent_1 ... done
Starting tmp_demo_1 ... done
Attaching to tmp_dependent_1, tmp_demo_1
demo_1 | dependent (172.23.0.2:4000) open
dependent_1 | listening on 0.0.0.0:4000 ...
dependent_1 | connect to 172.23.0.2:4000 from tmp_demo_1.tmp_default:45629 (172.23.0.3:45629)
dependent_1 | hello
tmp_demo_1 exited with code 0
tmp_dependent_1 exited with code 0
What you get on pds-dev
:
tmp-demo-1 | nc: bad address 'dependent'
Can you figure out how to get internal hostname (service name) resolution working @jpl-jengelke?
@jordanpadams I still think we should have pds-deploy.jpl.nasa.gov
as a new VM not on JPLNet.
@nutjob4life @jordanpadams I second the plan to have a separate deploy server if they'll do it.
Can you figure out how to get internal hostname (service name) resolution working @jpl-jengelke?
Probably. But it might involve setting up a combination of an /etc/hosts
file inside containers and/or using bridge networking to connect to each other by exposing ports (expose
keyword). I suspect it can happen with some detailed setup that maybe assigns internal IPs. I'm happy to look into it if I can make some time later in the week. Maybe take a look at this. The details of the network are available using docker network ls
and/or consider using host.docker.internal
to connect.
Yeesh, that sounds like it could kind of brittle and would involve services figuring out what IPs got assigned to them and then sharing that (perhaps on a message bussing queue) so each one could adjust its own /etc/hosts
and then communicate with the other services it needs. (And then there's bootstrapping the message queue itself!)
And it's not in the spirit of a Docker Composition ๐
Ultimately, the sys admins have given us a hobbled Docker environment that doesn't support the some of the most essential use cases advertised by Docker.
@jordanpadams I really want to push for a separate VM not on JPLNet that can be called pds-deploy
or something. We use this pattern on the Early Detection Research Network: various services like edrn.jpl.nasa.gov/cancerdataexpo
or edrn.jpl.nasa.gov/portal
or mcl.jpl.nasa.gov/portal
all reverse-proxy to a host edrn-docker
which is not on JPLNet and therefore runs a fully-functional Docker environemnt. No bullcrap.
... And it's not in the spirit of a Docker Composition ๐
Ultimately, the sys admins have given us a hobbled Docker environment that doesn't support the some of the most essential use cases advertised by Docker. ...
Not arguing against a separate VM. I've seen environments locked down similarly which is why I think Docker evolved to provide extra controls. We might not need to specify /etc/hosts
files if we use host.docker.internal
. This could allow us to connect container-to-container running on the same node without a lot of customizations. I don't know if it would be possible to hop between containers outside of one node, but I'm not sure that's the use case here.
@nutjob4life per:
@jordanpadams I really want to push for a separate VM not on JPLNet
Copy. This will have to happen in AWS then (hopefully not too expensive). As far and I understand it, a VM outside of JPLNet requires a DMZ and all kinds of approvals and such that we can't do.
Update: actually thinking about this some more, even getting something setup outside of JPLNet on AWS is going to require some effort. we will need a new public URL waiver for pds-deploy.jpl.nasa.gov. so will need to ping SAs to come up with the checklist of the paperwork that would need to be filed to make this happen.
also, thinking about this, the public VM should probably only be able to perform CD tested/operational tools? it seems like a possible security concern to continuously deploy in-development web services? i feel like there should at least be some "gate" before the service goes public. even if it is some sort of sign-off from I&T or the dev team or someone to trigger that deployment.
๐ February status: Research has been completed. But task will be ongoing into B13.0 tasks. Beta task. No impact on B12.1
Just a quick FYI, a non-JPLNet VM is trivial to set upโeven easier than provisioning AWS.
@nutjob4life awesome! let's do it then.
:calendar: March status: design and implementation underway. going to defer the rest of the implementation to B13.0. consider this closed, but going to keep it open for now and move to B13.0 because of the detailed conversation above.
moving this out of the Sprint Backlog. See sub-tasks for details
๐ May status: In progress. On schedule
:calendar: June status: Harness investigation ongoing. On schedule
๐ July status: Harness investigation ongoing. Completion delayed awaiting decision on Harness + NGAP + issues with subcontractor transitioning from Columbus to APR.
Pilots of Harness and other tools demo-ed. Going to close out this task as completed for B13.0 and will revisit in a future build when it makes sense to deploy Harness operationally in NGAP.
๐ Additional Details
Follow-on to https://github.com/NASA-PDS/devops/issues/3 design